Julian Seward [Sun, 21 Feb 2010 20:40:53 +0000 (20:40 +0000)]
CVTPI2PD (which converts 2 x I32 in M64 or MMX to 2 x F64 in XMM):
only switch the x87 FPU to MMX mode in the case where the source
operand is in memory, not in an MMX register. This fixes #210264.
This is all very fishy.
* it's inconsistent with all other instructions which convert between
values in (MMX or M64) and XMM, in that they put the FPU in MMX mode
even if the source is memory, not MMX. (for example, CVTPI2PS).
At least, that's what the Intel docs appear to say.
* the AMD documentation makes no mention at all of this. For example
it makes no differentiation in this matter between CVTPI2PD and
CVTPI2PS.
I wonder if Intel surreptitiously changed the behaviour of CVTPI2PD
since this code was written circa 5 years ago. Or, whether the Intel
and AMD implementations differ in this respect.
Julian Seward [Sun, 17 Jan 2010 15:47:01 +0000 (15:47 +0000)]
x86/amd64 front ends: don't chase a conditional branch that leads
back to the start of the trace. It's better to leave the IR loop
unroller to handle such cases.
Julian Seward [Fri, 15 Jan 2010 10:53:21 +0000 (10:53 +0000)]
Add logic to allow front ends to speculatively continue adding guest
instructions into IRSBs (superblocks) after conditional branches.
Currently only the x86 and amd64 front ends support this. The
assumption is that backwards conditional branches are taken and
forwards conditional branches are not taken, which is generally
regarded as plausible and is particularly effective with code compiled
by gcc at -O2, -O3 or -O -freorder-blocks (-freorder-blocks is enabled
by default at -O2 and above).
Is disabled by default. Has been seen to provide notable speedups
(eg, --tool=none for perf/bz2), and reduces the number of
block-to-block transitions dramatically, by up to half, but usually
makes programs run more slowly. Increases the amount of generated
code by at least 15%-20% and so is a net liability in terms of icache
misses and JIT time.
Julian Seward [Mon, 11 Jan 2010 10:46:18 +0000 (10:46 +0000)]
For 32-bit reads of integer guest registers, generate a 64-bit Get
followed by a Iop_64to32 narrowing, rather than doing a 32-bit Get.
This makes the Put-to-Get-forwarding optimisation work seamlessly for
code which does 32-bit register operations (very common), which it
never did before. Also add a folding rule to remove the resulting
32-to-64-to-32 widen-narrow chains.
This reduces the amount of code generated overall about 3%, but gives
a much larger speedup, of about 11% for Memcheck running perf/bz2.c.
Not sure why this is, perhaps due to reducing store bandwidth
requirements in the generated code, or due to avoiding
store-forwarding stalls when writing/reading the guest state.
Julian Seward [Sat, 9 Jan 2010 11:43:21 +0000 (11:43 +0000)]
* support PLD (cache-preload-hint) instructions
* start of a framework for decoding instructions in NV space
* fix a couple of unused/untested RRX shifter operand cases
Julian Seward [Thu, 31 Dec 2009 19:26:03 +0000 (19:26 +0000)]
Make the x86 and amd64 back ends use the revised prototypes for
genSpill and genReload. ppc32/64 backends are still broken.
Also, tidy up associated pointer-type casting in main_main.c.
Julian Seward [Thu, 31 Dec 2009 18:00:12 +0000 (18:00 +0000)]
Merge r1925:1948 from branches/ARM. This temporarily breaks all other
targets, because a few IR primops to do with int<->float conversions
have been renamed, and because an internal interface for creating
spill/reload instructions has changed.
Julian Seward [Thu, 26 Nov 2009 17:17:37 +0000 (17:17 +0000)]
Change the IR representation of load linked and store conditional.
They are now moved out into their own new IRStmt kind (IRStmt_LLSC),
and are not treated merely as variants of standard loads (IRExpr_Load)
or store (IRStmt_Store). This is necessary because load linked is a
load with a side effect (lodging a reservation), hence it cannot be an
IRExpr since IRExprs denote side-effect free value computations.
Fix up all front and back ends accordingly; also iropt.
Use a much faster hash function to do the self-modifying-code checks.
This reduces the extra overhead of --smc-check=all when running
Memcheck from about 75% to about 45%.
Julian Seward [Sun, 2 Aug 2009 14:35:45 +0000 (14:35 +0000)]
Implement mfpvr (mfspr 287) (bug #201585).
Also, fix a type mismatch in the generated IR for mfspr 268/269 which
would have caused an IR checker assertion failure when handling those
insns on ppc64.
Tell the register allocator on x86 that xmm0..7 are trashed across
function calls. This forces it to handle them as caller-saved, which
is (to the extent that it's possible to tell) what the ELF ABI
requires. Lack of this has been observed to corrupt floating point
computations in tools that use the xmm registers in the helper
functions called from generated code. This change brings the x86
backend into line with the amd64 backend, the latter of which has
always treated the xmm regs as caller-saved.
The x87 registers are still incorrectly handled as callee-saved.
Add new integer comparison primitives Iop_CasCmp{EQ,NE}{8,16,32,64},
which are semantically identical to Iop_Cmp{EQ,NE}{8,16,32,64}. Use
these new primitives instead of the normal ones, in the tests
following IR-level compare-and-swap operations, which establish
whether or not the CAS succeeded. This is all for Memcheck's benefit,
as it really needs to be able to identify which comparisons are
CAS-success tests and which aren't. This is all described in great
detail in memcheck/mc_translate.c in the comment
"COMMENT_ON_CasCmpEQ".
Flatten out the directory structure in the priv/ side, by pulling all
files into priv/ and giving them unique names. This makes it easier
to use automake to build all this stuff in Valgrind. It also tidies
up a directory structure which had become a bit pointlessly complex.
This branch adds proper support for atomic instructions, proper in the
sense that the atomicity is preserved through the compilation
pipeline, and thus in the instrumented code.
The change adds a new IR statement kind, IRStmt_CAS, which represents
single- and doubleword compare-and-swap. This is used as the basis
for the translation of all LOCK-prefixed instructions on x86 and
amd64.
The change also extends IRExpr_Load and IRStmt_Store so that
load-linked and store-conditional operations can be represented. This
facilitates correct translation of l[wd]arx and st[wd]cx. on ppc in
the sense that these instructions will now eventually be regenerated
at the end of the compilation pipeline.
Julian Seward [Thu, 19 Mar 2009 22:21:40 +0000 (22:21 +0000)]
In order to make it possible for Valgrind to restart client syscalls
that have been interrupted by signals, on Darwin, generalise an idea
which first emerged in the guest ppc32/64 stuff, in order to solve the
same problem on AIX.
Idea is: make all guests have a pseudo-register "IP_AT_SYSCALL", which
records the address of the most recently executed system call
instruction. Then, to back up the guest over the most recent syscall,
simply make its program counter equal to this value. This idea
already existing in the for ppc32/64 guests, but the register was
called "CIA_AT_SC".
Currently is not set in guest-amd64.
This commit will break the Valgrind svn trunk (temporarily).
Julian Seward [Wed, 4 Jun 2008 09:10:38 +0000 (09:10 +0000)]
Translate "fnstsw %ax" in a slightly different way, which plays better
with Memcheck's origin tracking stuff. a.k.a. a lame kludge. See
comments in source.
Julian Seward [Fri, 30 May 2008 22:58:07 +0000 (22:58 +0000)]
In some obscure circumstances, the allocator would incorrectly omit a
spill store on the basis that the register being spilled had the same
value as the spill slot being written to. This change is believed to
make the equals-spill-slot optimisation correct. Fixes a bug first
observed by Nuno Lopes and later by Marc-Oliver Straub.
Julian Seward [Sun, 11 May 2008 10:11:58 +0000 (10:11 +0000)]
Compute the starting address of the instruction correctly. This has
always been wrong and can cause the next-instruction-address to be
wrong in obscure circumstances. Fixes #152818.
Julian Seward [Mon, 31 Mar 2008 01:51:57 +0000 (01:51 +0000)]
Specialise CondNS after SUBB. The lack of this was causing Memcheck to
report false positives in some tricky bitfield code in OOo 2.4 (writer)
when loading MS Word docs.
Julian Seward [Wed, 6 Feb 2008 11:42:45 +0000 (11:42 +0000)]
Add SSSE3 support. Currently only for 64-bit. TODO:
* Check through IR generation
* For 128-bit variants accessing memory, generate an exception
if effective address is not 128-bit aligned
* Change CPUID output to be Core-2 like
* Enable for 32-bit code too.
* Make Memcheck handle the new IROps
* Commit test cases
Julian Seward [Mon, 26 Nov 2007 23:18:52 +0000 (23:18 +0000)]
Fix stupid bug in x86 isel: when generating code for a 64-bit integer
store, don't generate code to compute the address expression twice.
Spotted by Nick N whilst peering at code generated for new Massif.
Preventative changes in amd64 back end (which doesn't appear to have
the same problem).
Julian Seward [Mon, 19 Nov 2007 00:39:23 +0000 (00:39 +0000)]
Fix this:
vex: priv/guest-amd64/toIR.c:3741 (dis_Grp5): Assertion `sz == 4' failed.
(CALL Ev with sz==8) as reported in #150678 and #146252. Also change a
bunch of assertions on undecoded instructions into proper decoding failures.
Julian Seward [Thu, 15 Nov 2007 23:30:16 +0000 (23:30 +0000)]
Handle the "alternative" (non-binutils) encoding of 'adc' and tidy up
some other op-G-E / op-E-G decodings. This fixes a bug which was
reported on valgrind-users@lists.sourceforge.net on 11 Aug 2007
("LibVEX called failure_exit() with 3.3.0svn-r6769 with Linux on
AMD64") I don't think it ever was formally filed as a bug report.
Julian Seward [Fri, 9 Nov 2007 21:15:04 +0000 (21:15 +0000)]
Merge changes from THRCHECK branch r1787. These changes are all to do
with making x86/amd64 LOCK prefixes properly visible in the IR, since
threading tools need to see them. Probably would be no bad thing for
cachegrind/callgrind to notice them too, since asserting a bus lock on
a multiprocessor is an expensive event that programmers might like to
know about.
* amd64 front end: handle LOCK prefixes a lot more accurately
* x86 front end: ditto, and also a significant cleanup of prefix
handling, which was a mess
* To represent prefixes, remove the IR 'Ist_MFence' construction
and replace it with something more general: an IR Memory Bus
Event statement (Ist_MBE), which can represent lock
acquisition, lock release, and memory fences.
* Fix up all front ends and back ends to respectively generate
and handle Ist_MBE. Fix up the middle end (iropt) to deal with
them.