Modify this test so it no longer uses client requests, but instead
relies on --smc-support=all to work correctly. Hence it tests the
s-m-c support at least on x86. Jump through various hoops to defeat
vex's basic-block-chasing optimisation, which has an annoying habit of
making this test work correctly even without --smc-support=all.
Support for self modifying code on unfriendly platforms (x86, amd64)
via the use of self-checking translations. (Friendly platforms which
have icache-invalidation instructions we can observe, such as ppc32,
are already handled correctly.) This should finally fix the
longstanding problem of V incorrectly handling calls of statically
nested functions (a gcc extension), and more generally make it a lot
easier to use V to debug dynamic code generation systems.
Since self-checking is a large performance overhead, there is some
control via a command line flag:
--smc-support=none
Don't make any translations self-checking.
--smc-support=stack
Add checking code for translations taken from segments which
have the SF_GROWDOWN flag set -- stacks, basically.
This is the default. It should make gcc nested functions and
GNU Ada work correctly with no intervention from the user.
--smc-support=all
Make all translations self-checking. This is expensive and
you want to do this if you're debugging a JIT compiler or
some such.
Basic support for self-checking translations. It fits quite neatly
into the IR: if a translation self-check fails, the translation exits
passing VEX_TRC_JMP_TINVAL to the despatcher and with the
guest_TISTART/guest_TILEN pseudo-registers indicating what area of the
guest code needs to be invalidated. The actual checksumming is done
by a helper function which does (a variant of) the Adler32 checksum.
Space/time overhead, whilst substantial, looks tolerable. There's a
little room for optimisation of the basic scheme. It would certainly
be viable to run with self-checking for all translations to support
Valgrinding JITs (including V itself) without any assistance from the
JIT.
Scan the entire BB looking for "bogus literals"* before instrumenting
any of it, so as to avoid any problems arising from switching from one
scheme to the other half-way through.
Extensively re-analyse, re-check and revise the scheme for expensive
handling of integer EQ/NE, which can sometimes do better than the
naive scheme when the inputs are partially defined. I never was
convinced it was correct before, but now I am. Regtest to follow.
Tom Hughes [Tue, 5 Jul 2005 23:25:17 +0000 (23:25 +0000)]
Sort out the mess that is pread64/pwrite64 properly. All three platforms
that we currently support use the same handlers in the kernel without any
platform specific wrappers.
The final argument is a 64 bit argument however, which means that it
requires two registers on x86 and ppc32 and only one on amd64. The
reason it works in the kernel is that x86 and ppc32 calling conventions
inside the kernel work out correctly and the values get joined together.
For our purposes we make x86 and ppc32 use the generic veneer with
five arguments and amd64 use a platform specific one with four...
Disable PIE by default (sorry Tom), even on PIE-enabled platforms. It
causes too much breakage. PIE builds are still possible, but you have
to say --enable-pie to get them now.
Add a test script (recycled version of Tom's nightly/bin/nightly)
which is useful for doing automated test runs against the GNU
Scientific Library v 1.6 (gsl-1.6). This has proven very helpful in
shaking out Vex simulation bugs.
A further hack to reduce ppc32 reg-alloc costs: don't give the
regalloc so many registers to play with. In the majority of cases it
won't be able to make much use of vast hordes of FP and Altivec
registers anyway.
Fix (well, ameliorate, at least) some lurking performance problems
(time taken to do register allocation, not quality of result) which
were tolerable when allocating for x86/amd64 but got bad when dealing
with ppc-ish numbers of real registers (90 ish).
* Don't sanity-check the entire regalloc state after each insn
processed; this is total overkill. Instead do it every 7th insn
processed (somewhat arbitrarily) and just before the last insn.
* Reinstate an optimisation from the old UCode allocator: shadow
the primary state structure (rreg_state) with a redundant inverse
mapping (vreg_state) to remove the need to search
through rreg_state when looking for info about a given vreg, a
very common operation. Add logic to keep the two maps consistent.
Add a sanity check to ensure they really are consistent.
* Rename some variables and macros to make the code easier to
understand.
On x86->x86 (--tool=none), total Vex runtime is reduced by about 10%,
and amd64 is similar. For ppc32 the vex runtime is nearly halved. On
x86->x86 (--tool=none), register allocation now consumes only about
10% of the total Vex run time.
When hooked up to Valgrind, run time of short-running programs --
which is dominated by translation time -- is reduced by up to 10%.
Calltree/kcachegrind/cachegrind proved instrumental in tracking down
and quantifying these performance problems. Thanks, Josef & Nick.
Changed m_hashtable.c to allow the size of the hash table to be specified
when it is created. Fortunately this didn't affect code outside this
module except for the calls to VG_(HT_construct)().
As a result, we save some memory because not all tables have to be as big
as the ones needed for malloc/free tracking.
The logic that drove basic block to IR disassembly had been duplicated
over the 3 front ends (x86, amd64, ppc32). Given the need to take
into account basic block chasing, adding of instruction marks, etc,
the logic is not completely straightforward, and so commoning it up is
a good thing.
Fixed 'make dist'. In particular, all the arch/platform-specific files
get included in the distro now, not just the ones for the arch/platform
that the distro tarball is built on.
Try to make (client) clone() work for ppc32-linux. I don't know if I
was successful for real uses of clone, but fork-disguised-as-clone
appears to work now.
Julian Seward [Thu, 30 Jun 2005 23:31:27 +0000 (23:31 +0000)]
Enhance IR so as to distinguish between little- and big-endian loads and
stores, so that PPC can be properly handled. Until now it's been hardwired
to assume little-endian.
As a result, IRStmt_STle is renamed IRStmt_Store and IRExpr_LDle is
renamed IRExpr_Load.
Julian Seward [Thu, 30 Jun 2005 12:08:48 +0000 (12:08 +0000)]
Connect up the plumbing which allows the ppc32 front end to know the
cache line size it is supposed to simulate. Use this in
dis_cache_manage(). Finally reinstate 'dcbz'.
Julian Seward [Thu, 30 Jun 2005 11:49:14 +0000 (11:49 +0000)]
(API-visible change): generalise the VexSubArch idea. Everywhere
where a VexSubArch was previously passed around, a VexArchInfo is now
passed around. This is a struct which carries more details about any
given architecture and in particular gives a clean way to pass around
info about PPC cache line sizes, which is needed for guest-side PPC.