Some changes to the ppc32 compilation pipeline, with two aims:
* to achieve code quality comparable with x86/amd64 pipelines
* to make the value flow clearer to memcheck, in the hope of
reducing the very high false error rate it gives on ppc
Code quality is substantially improved, but the error rate is just as
high as it was before. Needs investigation.
Many instructions are now commented out -- mostly they just need
commenting back in. Simple integer programs (date, ls, xfontsel)
work.
Front end changes
~~~~~~~~~~~~~~~~
Change the way CR and XER are represented, and hence redo the way
integer comparisons and conditional branches work:
* Introduce a two new IR primops CmpORD32S and CmpORD32U; these do
ppc-style 3-way comparisons (<, >, ==). It's hard to simulate ppc
efficiently without them. Use these to implement integer compares.
* Get rid of all thunks for condition codes -- CR and XER state
is always up to date now.
* Split XER into four fields and CR into 16 fields, so that
their various components can be accessed directly without
endless shifting and masking. Created suitable impedance
matching functions to read/write XER and CR as a whole.
* Use hardware BI numbering throughout.
Back end changes
~~~~~~~~~~~~~~~
* Simplify condition code handling and use hardware BI numbering
throughout
* Reduce the number of instruction kinds by merging integer subtracts
and shifts into PPC32Instr_Alu32. Use rlwimi to do Shl/Shr by
immediate.
* Create a copy of PPC32RI (reg-or-imm) called PPC32RH
(reg-or-halfword-imm), and give the latter a flag indicating whether
the imm is regarded as signed or not. Use PPC32RH in most places
where PPC32RI was used before.
* Add instruction selection functions to compute a value into a
PPC32RI, a PPC32RH of specified signedness, and a PPC32RH variant in
which the immediate is unsigned and in the range 1 .. 31 inclusive
(used for shifts-by-immediate).
* Simplify PPC32Instr_MulL; all 3 operands are now simply registers.
* Add a new (fake) insn PPC32Instr_LI32 to get arbitrary 32-bit
immediates into int registers; this hides all the ugly li vs lis/ori
details.
Tidy up some loose ends in the self-checking-translations machinery,
and unroll the adler32 loop in a not-very-successful attempt to reduce
the overhead of checking.
Basic support for self-checking translations. It fits quite neatly
into the IR: if a translation self-check fails, the translation exits
passing VEX_TRC_JMP_TINVAL to the despatcher and with the
guest_TISTART/guest_TILEN pseudo-registers indicating what area of the
guest code needs to be invalidated. The actual checksumming is done
by a helper function which does (a variant of) the Adler32 checksum.
Space/time overhead, whilst substantial, looks tolerable. There's a
little room for optimisation of the basic scheme. It would certainly
be viable to run with self-checking for all translations to support
Valgrinding JITs (including V itself) without any assistance from the
JIT.
A further hack to reduce ppc32 reg-alloc costs: don't give the
regalloc so many registers to play with. In the majority of cases it
won't be able to make much use of vast hordes of FP and Altivec
registers anyway.
Fix (well, ameliorate, at least) some lurking performance problems
(time taken to do register allocation, not quality of result) which
were tolerable when allocating for x86/amd64 but got bad when dealing
with ppc-ish numbers of real registers (90 ish).
* Don't sanity-check the entire regalloc state after each insn
processed; this is total overkill. Instead do it every 7th insn
processed (somewhat arbitrarily) and just before the last insn.
* Reinstate an optimisation from the old UCode allocator: shadow
the primary state structure (rreg_state) with a redundant inverse
mapping (vreg_state) to remove the need to search
through rreg_state when looking for info about a given vreg, a
very common operation. Add logic to keep the two maps consistent.
Add a sanity check to ensure they really are consistent.
* Rename some variables and macros to make the code easier to
understand.
On x86->x86 (--tool=none), total Vex runtime is reduced by about 10%,
and amd64 is similar. For ppc32 the vex runtime is nearly halved. On
x86->x86 (--tool=none), register allocation now consumes only about
10% of the total Vex run time.
When hooked up to Valgrind, run time of short-running programs --
which is dominated by translation time -- is reduced by up to 10%.
Calltree/kcachegrind/cachegrind proved instrumental in tracking down
and quantifying these performance problems. Thanks, Josef & Nick.
The logic that drove basic block to IR disassembly had been duplicated
over the 3 front ends (x86, amd64, ppc32). Given the need to take
into account basic block chasing, adding of instruction marks, etc,
the logic is not completely straightforward, and so commoning it up is
a good thing.
Julian Seward [Thu, 30 Jun 2005 23:31:27 +0000 (23:31 +0000)]
Enhance IR so as to distinguish between little- and big-endian loads and
stores, so that PPC can be properly handled. Until now it's been hardwired
to assume little-endian.
As a result, IRStmt_STle is renamed IRStmt_Store and IRExpr_LDle is
renamed IRExpr_Load.
Julian Seward [Thu, 30 Jun 2005 12:08:48 +0000 (12:08 +0000)]
Connect up the plumbing which allows the ppc32 front end to know the
cache line size it is supposed to simulate. Use this in
dis_cache_manage(). Finally reinstate 'dcbz'.
Julian Seward [Thu, 30 Jun 2005 11:49:14 +0000 (11:49 +0000)]
(API-visible change): generalise the VexSubArch idea. Everywhere
where a VexSubArch was previously passed around, a VexArchInfo is now
passed around. This is a struct which carries more details about any
given architecture and in particular gives a clean way to pass around
info about PPC cache line sizes, which is needed for guest-side PPC.
guest-ppc32
~~~~~~~~~~
- store-with-update instrs: Valgrind pagefault handler expects faulting address >= current stack ptr, so we need to update the stack ptr register _before_ storing the old stack ptr
- branch_ctr_ok (bad calc for 'branch if %ctr zero' case)
- mcrf: scanning bitfields in the wrong direction
- on spotting the magic sequence, delta += 24
- updated DIPs for +ve-only args
host-ppc32
~~~~~~~~~
- fixed CMov reg usage
- fixed Pin_Call in emit_PPC32Instr(): we already know how far we're jumping
- fixed Pin_Goto in emit_PPC32Instr(): vassert right range of jump deltas
other-ppc32
~~~~~~~~~~
- exported OFFSET_ppc32_(various) for valgrind
Julian Seward [Wed, 18 May 2005 11:47:47 +0000 (11:47 +0000)]
Handle XCHG rAX, reg for 32-bit regs as well as 64-bit regs. I'm not
sure this is right -- the AMD64 docs are very difficult to interpret
on the subtle point of precisely what is and isn't to be regarded as a
no-op.
Julian Seward [Thu, 12 May 2005 17:55:01 +0000 (17:55 +0000)]
Add the beginnings of what might be a general mechanism to pass
ABI-specific knowledge through the IR compilation pipeline. This
entails a new IR construction, AbiHint.
Currently there is only one kind of hint, and it is generated by the
amd64 front end. This tells whoever wants to know that a function
call or return has happened, and so the 128 bytes below %rsp should be
considered undefined.
Julian Seward [Wed, 11 May 2005 23:16:13 +0000 (23:16 +0000)]
Allow reg-alloc to use %rbx. This is a callee-saved register and
therefore particularly valuable - bringing it into circulation reduces
the volume of code generated by memcheck by about 3%.
Julian Seward [Wed, 11 May 2005 22:55:08 +0000 (22:55 +0000)]
Ah, the joys of register allocation. You might think that giving
reg-alloc as many registers as possible maximises performance. You
would be wrong. Giving it more registers generates more spilling of
caller-saved regs around the innumerable helper calls created by
Memcheck. What we really need are zillions of callee-save registers,
but those are in short supply. Hmm, perhaps I should let it use %rbx
too -- that's listed as callee-save.
Anyway, the current arrangement allows reg-alloc to use 8
general-purpose regs and 10 xmm registers. The x87 registers are not
used at all. This seems to work fairly well.