Julian Seward [Sun, 21 Aug 2005 00:48:37 +0000 (00:48 +0000)]
On a PPC32Instr_Call, don't merely record how many integer registers
carry parameters. Instead record the actual identities of such
registers in a bitmask. This is necessary because the PPC calling
conventions have "holes" in the register ranges. For example, a
routine taking an UInt(32-bit) first param and an ULong(64-bit) second
param passes the first arg in r3 but the second one in r5 and r6, and
r4 is not used.
Julian Seward [Thu, 18 Aug 2005 11:50:43 +0000 (11:50 +0000)]
Add tested but unused code just in case it is useful at some point in
the future: a potentially more memcheck-friendly implementation of
count-leading-zeroes.
Never ever delete vex_svnversion.h except when doing 'make version'.
Purpose is so that 'make distclean' or 'make clean' in a tarball'd
build do not delete it, and so do not render the tree unbuildable.
Don't delete vex_svnversion.h during 'make clean'. This causes
breakage if someone builds from the final V tarball, then does 'make
clean', then re-runs make -- because creating this file requires (1)
svnversion to be present on the end-user system, which it probably
isn't, and (2) the metadata which svnversion consults also to be
present here, which it certainly isn't [in the cut-down VEX image in
the distro tarball.]
Fix very stupid bug in my mtxer implementation. The relevant IRStmts
would work better (viz, at all :-) if they were added to the IR code
list after being created, instead of merely being dropped down the
back of the fridge.
Some changes to the ppc32 compilation pipeline, with two aims:
* to achieve code quality comparable with x86/amd64 pipelines
* to make the value flow clearer to memcheck, in the hope of
reducing the very high false error rate it gives on ppc
Code quality is substantially improved, but the error rate is just as
high as it was before. Needs investigation.
Many instructions are now commented out -- mostly they just need
commenting back in. Simple integer programs (date, ls, xfontsel)
work.
Front end changes
~~~~~~~~~~~~~~~~
Change the way CR and XER are represented, and hence redo the way
integer comparisons and conditional branches work:
* Introduce a two new IR primops CmpORD32S and CmpORD32U; these do
ppc-style 3-way comparisons (<, >, ==). It's hard to simulate ppc
efficiently without them. Use these to implement integer compares.
* Get rid of all thunks for condition codes -- CR and XER state
is always up to date now.
* Split XER into four fields and CR into 16 fields, so that
their various components can be accessed directly without
endless shifting and masking. Created suitable impedance
matching functions to read/write XER and CR as a whole.
* Use hardware BI numbering throughout.
Back end changes
~~~~~~~~~~~~~~~
* Simplify condition code handling and use hardware BI numbering
throughout
* Reduce the number of instruction kinds by merging integer subtracts
and shifts into PPC32Instr_Alu32. Use rlwimi to do Shl/Shr by
immediate.
* Create a copy of PPC32RI (reg-or-imm) called PPC32RH
(reg-or-halfword-imm), and give the latter a flag indicating whether
the imm is regarded as signed or not. Use PPC32RH in most places
where PPC32RI was used before.
* Add instruction selection functions to compute a value into a
PPC32RI, a PPC32RH of specified signedness, and a PPC32RH variant in
which the immediate is unsigned and in the range 1 .. 31 inclusive
(used for shifts-by-immediate).
* Simplify PPC32Instr_MulL; all 3 operands are now simply registers.
* Add a new (fake) insn PPC32Instr_LI32 to get arbitrary 32-bit
immediates into int registers; this hides all the ugly li vs lis/ori
details.
Tidy up some loose ends in the self-checking-translations machinery,
and unroll the adler32 loop in a not-very-successful attempt to reduce
the overhead of checking.
Basic support for self-checking translations. It fits quite neatly
into the IR: if a translation self-check fails, the translation exits
passing VEX_TRC_JMP_TINVAL to the despatcher and with the
guest_TISTART/guest_TILEN pseudo-registers indicating what area of the
guest code needs to be invalidated. The actual checksumming is done
by a helper function which does (a variant of) the Adler32 checksum.
Space/time overhead, whilst substantial, looks tolerable. There's a
little room for optimisation of the basic scheme. It would certainly
be viable to run with self-checking for all translations to support
Valgrinding JITs (including V itself) without any assistance from the
JIT.
A further hack to reduce ppc32 reg-alloc costs: don't give the
regalloc so many registers to play with. In the majority of cases it
won't be able to make much use of vast hordes of FP and Altivec
registers anyway.
Fix (well, ameliorate, at least) some lurking performance problems
(time taken to do register allocation, not quality of result) which
were tolerable when allocating for x86/amd64 but got bad when dealing
with ppc-ish numbers of real registers (90 ish).
* Don't sanity-check the entire regalloc state after each insn
processed; this is total overkill. Instead do it every 7th insn
processed (somewhat arbitrarily) and just before the last insn.
* Reinstate an optimisation from the old UCode allocator: shadow
the primary state structure (rreg_state) with a redundant inverse
mapping (vreg_state) to remove the need to search
through rreg_state when looking for info about a given vreg, a
very common operation. Add logic to keep the two maps consistent.
Add a sanity check to ensure they really are consistent.
* Rename some variables and macros to make the code easier to
understand.
On x86->x86 (--tool=none), total Vex runtime is reduced by about 10%,
and amd64 is similar. For ppc32 the vex runtime is nearly halved. On
x86->x86 (--tool=none), register allocation now consumes only about
10% of the total Vex run time.
When hooked up to Valgrind, run time of short-running programs --
which is dominated by translation time -- is reduced by up to 10%.
Calltree/kcachegrind/cachegrind proved instrumental in tracking down
and quantifying these performance problems. Thanks, Josef & Nick.
The logic that drove basic block to IR disassembly had been duplicated
over the 3 front ends (x86, amd64, ppc32). Given the need to take
into account basic block chasing, adding of instruction marks, etc,
the logic is not completely straightforward, and so commoning it up is
a good thing.
Julian Seward [Thu, 30 Jun 2005 23:31:27 +0000 (23:31 +0000)]
Enhance IR so as to distinguish between little- and big-endian loads and
stores, so that PPC can be properly handled. Until now it's been hardwired
to assume little-endian.
As a result, IRStmt_STle is renamed IRStmt_Store and IRExpr_LDle is
renamed IRExpr_Load.
Julian Seward [Thu, 30 Jun 2005 12:08:48 +0000 (12:08 +0000)]
Connect up the plumbing which allows the ppc32 front end to know the
cache line size it is supposed to simulate. Use this in
dis_cache_manage(). Finally reinstate 'dcbz'.
Julian Seward [Thu, 30 Jun 2005 11:49:14 +0000 (11:49 +0000)]
(API-visible change): generalise the VexSubArch idea. Everywhere
where a VexSubArch was previously passed around, a VexArchInfo is now
passed around. This is a struct which carries more details about any
given architecture and in particular gives a clean way to pass around
info about PPC cache line sizes, which is needed for guest-side PPC.