Support the DCBZL instruction. Also, query the host CPU at startup
time to find out how much space DCBZL really clears, and make the
guest CPU act accordingly. (VEX-side changes)
(Dave Goodell, goodell@mcs.anl.gov)
Julian Seward [Sun, 22 Aug 2010 12:59:02 +0000 (12:59 +0000)]
Merge from branches/THUMB: new IR primops and associated
infrastructure, needed to represent NEON instructions. Way more new
ones than I would like, but I can't see a way to avoid having them.
Julian Seward [Sun, 22 Aug 2010 12:54:56 +0000 (12:54 +0000)]
Merge from branches/THUMB: hwcaps for ARM. May get simplified since
in fact ARM v5 and v6 are not supported targets -- ARMv7 remains the
minimum supported target.
Julian Seward [Sun, 22 Aug 2010 12:44:20 +0000 (12:44 +0000)]
Merge from branches/THUMB: front end changes to support:
* Thumb integer instructions
* NEON in both ARM and Thumb mode
* VFP in both ARM and Thumb mode
* infrastructure to support APSR.Q flag representation
Julian Seward [Sun, 22 Aug 2010 12:38:53 +0000 (12:38 +0000)]
Merge from branches/THUMB: A spechelper interface change that allows
the helper to look back at the previous IR statements. May be backed
out if it turns out no longer to be needed for optimising Thumb
translations.
Julian Seward [Tue, 17 Aug 2010 22:52:08 +0000 (22:52 +0000)]
Add a moderately comprehensive implementation of the SSE4.2 string
instructions PCMP{I,E}STR{I,M}. They are an absolute nightmare of
complexity. Most of the 8-bit data processing variants are supported,
but none of the 16-bit variants.
Also add support for PINSRB and PTEST.
With these changes, I believe Valgrind supports all the SSE4.2
instructions used in glibc-2.11 on x86_64-linux, as well as anything
that gcc can emit. So that gives fairly good coverage.
Currently these instructions are handled, but CPUID still claims to be
an older, non-SSE4 capable Core 2, so that software that correctly
checks CPU features should not use them. Following further testing I
will enable the relevant SSE4.2 bits in CPUID.
Julian Seward [Fri, 6 Aug 2010 07:59:38 +0000 (07:59 +0000)]
Add partial support for the SSE 4.2 PCMPISTRI instruction, at least
for (some of) the sub-cases that glibc uses (64-bit mode only). Also,
prepare for transitioning CPUID in 64-bit mode to indicate SSE4.2
support (not yet enabled).
Be warned, this commit will require a from-clean rebuild of Valgrind.
Don't trash the ELF ABI redzone for amd64 when emulating BT{,S,R,C}
reg,reg. Fixes (well, at least, makes an appalling kludge a bit less
appalling) #245925.
Handle mov[ua[pd G(xmm) -> E(xmm) case, which is something binutils
doesn't produce, presumably because it uses the E->G encoding for xmm
reg-reg moves. Fixes #238713. (Pierre Willenbrock,
pierre@pirsoft.de).
Support the SSE4 insn 'roundss' in 32-bit mode. Lack of this was
causing problems for people running 32-bit apps on MacOSX 10.6 on
newer hardware. Fixes #241377.
Julian Seward [Fri, 18 Jun 2010 08:17:41 +0000 (08:17 +0000)]
Implement SSE4 instructions: PCMPGTQ PMAXUD PMINUD PMAXSB PMINSB PMULLD
I believe this covers everything that gcc-4.4 and gcc-4.5 will generate
with "-O3 -msse4.2". Note, this commit changes the set of IR ops and so
requires a from-scratch rebuild of the tree.
Julian Seward [Mon, 7 Jun 2010 16:22:22 +0000 (16:22 +0000)]
Implement SIDT and SGDT as pass-throughs to the host. It's a pretty
bad thing to do, but I can't think of a way to virtualise these
properly. Patch from Alexander Potapenko. See
https://bugs.kde.org/show_bug.cgi?id=205241#c38
Julian Seward [Tue, 4 May 2010 08:48:43 +0000 (08:48 +0000)]
Handle v7 memory fence instructions: ISB DSB DMB and their v6 equivalents:
mcr 15,0,r0,c7,c5,4 mcr 15,0,r0,c7,c10,4 mcr 15,0,r0,c7,c10,5
respectively. Re-emit them in the v6 form so as not to inhibit possible
support for v6-only platforms in the future. Extended version of a patch
from Alexander Potapenko (glider@google.com). Fixes bug 228060.
Julian Seward [Sun, 21 Feb 2010 20:40:53 +0000 (20:40 +0000)]
CVTPI2PD (which converts 2 x I32 in M64 or MMX to 2 x F64 in XMM):
only switch the x87 FPU to MMX mode in the case where the source
operand is in memory, not in an MMX register. This fixes #210264.
This is all very fishy.
* it's inconsistent with all other instructions which convert between
values in (MMX or M64) and XMM, in that they put the FPU in MMX mode
even if the source is memory, not MMX. (for example, CVTPI2PS).
At least, that's what the Intel docs appear to say.
* the AMD documentation makes no mention at all of this. For example
it makes no differentiation in this matter between CVTPI2PD and
CVTPI2PS.
I wonder if Intel surreptitiously changed the behaviour of CVTPI2PD
since this code was written circa 5 years ago. Or, whether the Intel
and AMD implementations differ in this respect.
Julian Seward [Sun, 17 Jan 2010 15:47:01 +0000 (15:47 +0000)]
x86/amd64 front ends: don't chase a conditional branch that leads
back to the start of the trace. It's better to leave the IR loop
unroller to handle such cases.
Julian Seward [Fri, 15 Jan 2010 10:53:21 +0000 (10:53 +0000)]
Add logic to allow front ends to speculatively continue adding guest
instructions into IRSBs (superblocks) after conditional branches.
Currently only the x86 and amd64 front ends support this. The
assumption is that backwards conditional branches are taken and
forwards conditional branches are not taken, which is generally
regarded as plausible and is particularly effective with code compiled
by gcc at -O2, -O3 or -O -freorder-blocks (-freorder-blocks is enabled
by default at -O2 and above).
Is disabled by default. Has been seen to provide notable speedups
(eg, --tool=none for perf/bz2), and reduces the number of
block-to-block transitions dramatically, by up to half, but usually
makes programs run more slowly. Increases the amount of generated
code by at least 15%-20% and so is a net liability in terms of icache
misses and JIT time.
Julian Seward [Mon, 11 Jan 2010 10:46:18 +0000 (10:46 +0000)]
For 32-bit reads of integer guest registers, generate a 64-bit Get
followed by a Iop_64to32 narrowing, rather than doing a 32-bit Get.
This makes the Put-to-Get-forwarding optimisation work seamlessly for
code which does 32-bit register operations (very common), which it
never did before. Also add a folding rule to remove the resulting
32-to-64-to-32 widen-narrow chains.
This reduces the amount of code generated overall about 3%, but gives
a much larger speedup, of about 11% for Memcheck running perf/bz2.c.
Not sure why this is, perhaps due to reducing store bandwidth
requirements in the generated code, or due to avoiding
store-forwarding stalls when writing/reading the guest state.
Julian Seward [Sat, 9 Jan 2010 11:43:21 +0000 (11:43 +0000)]
* support PLD (cache-preload-hint) instructions
* start of a framework for decoding instructions in NV space
* fix a couple of unused/untested RRX shifter operand cases
Julian Seward [Thu, 31 Dec 2009 19:26:03 +0000 (19:26 +0000)]
Make the x86 and amd64 back ends use the revised prototypes for
genSpill and genReload. ppc32/64 backends are still broken.
Also, tidy up associated pointer-type casting in main_main.c.
Julian Seward [Thu, 31 Dec 2009 18:00:12 +0000 (18:00 +0000)]
Merge r1925:1948 from branches/ARM. This temporarily breaks all other
targets, because a few IR primops to do with int<->float conversions
have been renamed, and because an internal interface for creating
spill/reload instructions has changed.
Julian Seward [Thu, 26 Nov 2009 17:17:37 +0000 (17:17 +0000)]
Change the IR representation of load linked and store conditional.
They are now moved out into their own new IRStmt kind (IRStmt_LLSC),
and are not treated merely as variants of standard loads (IRExpr_Load)
or store (IRStmt_Store). This is necessary because load linked is a
load with a side effect (lodging a reservation), hence it cannot be an
IRExpr since IRExprs denote side-effect free value computations.
Fix up all front and back ends accordingly; also iropt.
Use a much faster hash function to do the self-modifying-code checks.
This reduces the extra overhead of --smc-check=all when running
Memcheck from about 75% to about 45%.
Julian Seward [Sun, 2 Aug 2009 14:35:45 +0000 (14:35 +0000)]
Implement mfpvr (mfspr 287) (bug #201585).
Also, fix a type mismatch in the generated IR for mfspr 268/269 which
would have caused an IR checker assertion failure when handling those
insns on ppc64.
Tell the register allocator on x86 that xmm0..7 are trashed across
function calls. This forces it to handle them as caller-saved, which
is (to the extent that it's possible to tell) what the ELF ABI
requires. Lack of this has been observed to corrupt floating point
computations in tools that use the xmm registers in the helper
functions called from generated code. This change brings the x86
backend into line with the amd64 backend, the latter of which has
always treated the xmm regs as caller-saved.
The x87 registers are still incorrectly handled as callee-saved.
Add new integer comparison primitives Iop_CasCmp{EQ,NE}{8,16,32,64},
which are semantically identical to Iop_Cmp{EQ,NE}{8,16,32,64}. Use
these new primitives instead of the normal ones, in the tests
following IR-level compare-and-swap operations, which establish
whether or not the CAS succeeded. This is all for Memcheck's benefit,
as it really needs to be able to identify which comparisons are
CAS-success tests and which aren't. This is all described in great
detail in memcheck/mc_translate.c in the comment
"COMMENT_ON_CasCmpEQ".
Flatten out the directory structure in the priv/ side, by pulling all
files into priv/ and giving them unique names. This makes it easier
to use automake to build all this stuff in Valgrind. It also tidies
up a directory structure which had become a bit pointlessly complex.