Julian Seward [Mon, 26 Nov 2007 23:18:52 +0000 (23:18 +0000)]
Fix stupid bug in x86 isel: when generating code for a 64-bit integer
store, don't generate code to compute the address expression twice.
Spotted by Nick N whilst peering at code generated for new Massif.
Preventative changes in amd64 back end (which doesn't appear to have
the same problem).
Julian Seward [Mon, 19 Nov 2007 00:39:23 +0000 (00:39 +0000)]
Fix this:
vex: priv/guest-amd64/toIR.c:3741 (dis_Grp5): Assertion `sz == 4' failed.
(CALL Ev with sz==8) as reported in #150678 and #146252. Also change a
bunch of assertions on undecoded instructions into proper decoding failures.
Julian Seward [Thu, 15 Nov 2007 23:30:16 +0000 (23:30 +0000)]
Handle the "alternative" (non-binutils) encoding of 'adc' and tidy up
some other op-G-E / op-E-G decodings. This fixes a bug which was
reported on valgrind-users@lists.sourceforge.net on 11 Aug 2007
("LibVEX called failure_exit() with 3.3.0svn-r6769 with Linux on
AMD64") I don't think it ever was formally filed as a bug report.
Julian Seward [Fri, 9 Nov 2007 21:15:04 +0000 (21:15 +0000)]
Merge changes from THRCHECK branch r1787. These changes are all to do
with making x86/amd64 LOCK prefixes properly visible in the IR, since
threading tools need to see them. Probably would be no bad thing for
cachegrind/callgrind to notice them too, since asserting a bus lock on
a multiprocessor is an expensive event that programmers might like to
know about.
* amd64 front end: handle LOCK prefixes a lot more accurately
* x86 front end: ditto, and also a significant cleanup of prefix
handling, which was a mess
* To represent prefixes, remove the IR 'Ist_MFence' construction
and replace it with something more general: an IR Memory Bus
Event statement (Ist_MBE), which can represent lock
acquisition, lock release, and memory fences.
* Fix up all front ends and back ends to respectively generate
and handle Ist_MBE. Fix up the middle end (iropt) to deal with
them.
Julian Seward [Tue, 28 Aug 2007 06:06:27 +0000 (06:06 +0000)]
Merge, from CGTUNE branch, r1774:
Vex-side changes to allow tools to provide a final_tidy function which
they can use to mess with the final post-tree-built IR before it is
handed off to instruction selection.
Julian Seward [Sat, 25 Aug 2007 23:21:08 +0000 (23:21 +0000)]
Merge from CGTUNE branch, code generation improvements for amd64:
r1772:
When generating code for helper calls, be more aggressive about
computing values directly into argument registers, thereby avoiding
some reg-reg shuffling. This reduces the amount of code (on amd64)
generated by Cachegrind by about 6% and has zero or marginal benefit
for other tools.
r1773:
Emit 64-bit branch targets using 32-bit short forms when possible.
Since (with V's default amd64 load address of 0x38000000) this is
usually possible, it saves about 7% in code size for Memcheck and even
more for Cachegrind.
Julian Seward [Sat, 25 Aug 2007 23:07:44 +0000 (23:07 +0000)]
Merge from CGTUNE branch:
r1769:
This commit provides a bunch of enhancements to the IR optimiser
(iropt) and to the various backend instruction selectors.
Unfortunately the changes are interrelated and cannot easily be
committed in pieces in any meaningful way. Between them and the
already-committed register allocation enhancements (r1765, r1767)
performance of Memcheck is improved by 0%-10%. Improvements are also
applicable to other tools to lesser extents.
Main changes are:
* Add new IR primops Iop_Left64/32/16/8 and Iop_CmpwNEZ64/32/16/8
which Memcheck uses to express some primitive operations on
definedness (V) bits:
Left(x) = set all bits to the left of the rightmost 1 bit to 1
CmpwNEZ(x) = if x == 0 then 0 else 0xFF...FF
Left and CmpwNEZ are detailed in the Usenix 2005 paper (in which
CmpwNEZ is called PCast). The new primops expose opportunities for
IR optimisation at tree-build time. Prior to this change Memcheck
expressed Left and CmpwNEZ in terms of lower level primitives
(logical or, negation, compares, various casts) which was simpler
but hindered further optimisation.
* Enhance the IR optimiser's tree builder so it can rewrite trees
as they are constructed, according to useful identities, for example:
CmpwNEZ64( Or64 ( CmpwNEZ64(x), y ) ) --> CmpwNEZ64( Or64( x, y ) )
which gets rid of a CmpwNEZ64 operation - a win as they are relatively
expensive. See functions fold_IRExpr_Binop and fold_IRExpr_Unop.
Allowing the tree builder to rewrite trees also makes it possible to
have a single implementation of certain transformation rules which
were previously duplicated in the x86, amd64 and ppc instruction
selectors. For example
32to1(1Uto32(x)) --> x
This simplifies the instruction selectors and gives a central place
to put such IR-level transformations, which is a Good Thing.
* Various minor refinements to the instruction selectors:
- ppc64 generates 32Sto64 into 1 instruction instead of 2
- x86 can now generate movsbl
- x86 handles 64-bit integer Mux0X better for cases typically
arising from Memchecking of FP code
- misc other patterns handled better
Overall these changes are a straight win - vex generates less code,
and does so a bit faster since its register allocator has to chew
through fewer instructions. The main risk is that of correctness:
making Left and CmpwNEZ explicit, and adding rewrite rules for them,
is a substantial change in the way Memcheck deals with undefined value
tracking, and I am concerned to ensure that the changes do not cause
false negatives. I _think_ it's all correct so far.
r1770:
Get rid of Iop_Neg64/32/16/8 as they are no longer used by Memcheck,
and any uses as generated by the front ends are so infrequent that
generating the equivalent Sub(0, ..) is good enough. This gets rid of
quite a few lines of code. Add isel cases for Sub(0, ..) patterns so
that the x86/amd64 backends still generate negl/negq where possible.
r1771:
Handle Left64. Fixes failure on none/tests/x86/insn_sse2.
Julian Seward [Sat, 25 Aug 2007 21:29:03 +0000 (21:29 +0000)]
Merge, from CGTUNE branch:
r1768:
Cosmetic (non-functional) changes associated with r1767.
r1767:
Add a second spill-code-avoidance optimisation, which could be called
'directReload' for lack of a better name.
If an instruction reads exactly one vreg which is currently in a spill
slot, and this is last use of that vreg, see if the instruction can be
converted into one that reads directly from the spill slot. This is
clearly only possible for x86 and amd64 targets, since ppc is a
load-store architecture. So, for example,
orl %vreg, %dst
where %vreg is in a spill slot, and this is its last use, would
previously be converted to
movl $spill-offset(%ebp), %tmp
orl %tmp, %dst
whereas now it becomes
orl $spill-offset(%ebp), %dst
This not only avoids an instruction, it eliminates the need for a
reload temporary (%tmp in this example) and so potentially further
reduces spilling.
Implementation is in two parts: an architecture independent part, in
reg_alloc2.c, which finds candidate instructions, and a host dependent
function (directReload_ARCH) for each arch supporting the
optimisation. The directReload_ function does the instruction form
conversion, when possible. Currently only x86 hosts are supported.
As a side effect, change the form of the X86_Test32 instruction from
reg-only to reg/mem so it can participate in such transformations.
This gives a code size reduction of 0.6% for perf/bz2 on x86 memcheck,
but tends to be more effective for long blocks of x86 FP code.
Julian Seward [Sat, 25 Aug 2007 21:11:33 +0000 (21:11 +0000)]
Merge, from CGTUNE branch:
r1765:
During register allocation, keep track of which (real) registers have
the same value as their associated spill slot. Then, if a register
needs to be freed up for some reason, and that register has the same
value as its spill slot, there is no need to produce a spill store.
This substantially reduces the number of spill store instructions
created. Overall gives a 1.9% generated code size reduction for
perf/bz2 running on x86.
r1766:
Followup to r1765: fix some comments, and rearrange fields in struct
RRegState so as to fit it into 16 bytes.
Julian Seward [Tue, 1 May 2007 13:53:01 +0000 (13:53 +0000)]
Stop gcc-4.2 producing hundreds of complaints of the form "warning:
cast from pointer to integer of different size" when compiling on a
64-bit target. gcc-4.2 is correct to complain. An interesting
question is why no previous gcc warned about this.
Julian Seward [Sat, 31 Mar 2007 14:30:12 +0000 (14:30 +0000)]
Teach the x86 back end how generate 'lea' instructions, and generate
them in a couple of places which are important. This reduces the
amount of generated code for memcheck and none by about 1%, and (in
very unscientific tests on perf/bz2) speeds memcheck up by about 1%.
Julian Seward [Sun, 25 Mar 2007 04:14:58 +0000 (04:14 +0000)]
x86 back end: use 80-bit loads/stores for floating point spills rather
than 64-bit ones, to reduce accuracy loss. To support this, in
reg-alloc, allocate 2 64-bit spill slots for each HRcFlt64 vreg
instead of just 1.
Julian Seward [Tue, 20 Mar 2007 14:18:45 +0000 (14:18 +0000)]
x86 front end: synthesise SIGILL in the normal way for some obscure
invalid instruction cases, rather than asserting, as happened in
#143079 and #142279. amd64 equivalents to follow.
Julian Seward [Fri, 9 Mar 2007 18:07:00 +0000 (18:07 +0000)]
When generating 64-bit code, ensure that any addresses used in 4 or 8
byte loads or stores of the form reg+imm have the lowest 2 bits of imm
set to zero, so that they can safely be used in ld/ldu/lda/std/stdu
instructions. This boils down to doing an extra check in
iselWordExpr_AMode and avoiding the reg+imm case in cases where the
amode might end up in any of the abovementioned instructions.
Julian Seward [Sat, 27 Jan 2007 00:46:28 +0000 (00:46 +0000)]
Fill in missing cases in eqIRConst. This stops iropt's CSE pass from
asserting in the presence of V128 immediates, which is a regression
in valgrind 3.2.2.
Julian Seward [Wed, 10 Jan 2007 04:59:33 +0000 (04:59 +0000)]
Implement FXSAVE on amd64. Mysteriously my Athlon64 does not seem to
write all the fields that the AMD documentation says it should: it
skips ROP, RIP and RDP, so vex's implementation writes zeroes there.
Julian Seward [Sun, 24 Dec 2006 02:20:24 +0000 (02:20 +0000)]
A large but non-functional commit: as suggested by Nick, rename some
IR types, structure fields and functions to make IR a bit easier to
understand. Specifically:
dopyIR* -> deepCopyIR*
sopyIR* -> shallowCopyIR*
The presence of a .Tmp union in both IRExpr and IRStmt is
confusing. It has been renamed to RdTmp in IRExpr, reflecting
the fact that here we are getting the value of an IRTemp, and to
WrTmp in IRStmt, reflecting the fact that here we are assigning
to an IRTemp.
IRBB (IR Basic Block) is renamed to IRSB (IR SuperBlock),
reflecting the reality that Vex does not really operate in terms
of basic blocks, but in terms of superblocks - single entry,
multiple exit sequences.
IRArray is renamed to IRRegArray, to make it clearer it refers
to arrays of guest registers and not arrays in memory.
VexMiscInfo is renamed to VexAbiInfo, since that's what it is
-- relevant facts about the ABI (calling conventions, etc) for
both the guest and host platforms.
Julian Seward [Fri, 1 Dec 2006 02:59:17 +0000 (02:59 +0000)]
Change a stupid algorithm that deals with real register live
ranges into a less stupid one. Prior to this change, the complexity
of reg-alloc included an expensive term
O(#instrs in code sequence x #real-register live ranges in code sequence)
This commit changes that term to essentially
O(#instrs in code sequence) + O(time to sort real-reg-L-R array)
On amd64 this nearly halves the cost of register allocation and means
Valgrind performs better in translation-intensive situations (a.k.a
starting programs). Eg, firefox start/exit falls from 119 to 113
seconds. The effect will be larger on ppc32/64 as there are more real
registers and hence real-reg live ranges to consider, and will be
smaller on x86 for the same reason.
The actual code the JIT produces should be unchanged. This commit
merely modifies how the register allocator handles one of its
important data structures.
Julian Seward [Thu, 19 Oct 2006 03:01:09 +0000 (03:01 +0000)]
When doing rlwinm in 64-bit mode, bind the intermediate 32-bit result
to a temporary so it is only computed once. What's there currently
causes it to be computed twice.
Julian Seward [Tue, 17 Oct 2006 00:28:22 +0000 (00:28 +0000)]
Merge r1663-r1666:
- AIX5 build changes
- genoffsets.c: print the offsets of a few more ppc registers
- Get rid of a bunch of ad-hoc hacks which hardwire in certain
assumptions about guest and host ABIs. Instead pass that info
in a VexMiscInfo structure. This cleans up various grotty bits.
- Add to ppc32 guest state, redirection-stack stuff already present
in ppc64 guest state. This is to enable function redirection/
wrapping in the presence of TOC pointers in 32-bit mode.
- Add to both ppc32 and ppc64 guest states, a new pseudo-register
LR_AT_SC. This holds the link register value at the most recent
'sc', so that AIX can back up to restart a syscall if needed.
- Add to both ppc32 and ppc64 guest states, a SPRG3 register.
- Use VexMiscInfo to handle 'sc' on AIX differently from Linux:
on AIX, 'sc' continues at the location stated in the link
register, not at the next insn.
Add support for amd64 'fprem' (fixes bug 132918). This isn't exactly
right; the C3/2/1/0 FPU flags sometimes don't get set the same as
natively, and I can't figure out why.
Julian Seward [Sat, 19 Aug 2006 18:31:53 +0000 (18:31 +0000)]
Comparing a reg with itself produces a result which doesn't depend on
the contents of the reg. Therefore remove the false dependency, which
has been known to cause memcheck to produce false errors for
xlc-compiled code.