Julian Seward [Sat, 25 Aug 2007 23:07:44 +0000 (23:07 +0000)]
Merge from CGTUNE branch:
r1769:
This commit provides a bunch of enhancements to the IR optimiser
(iropt) and to the various backend instruction selectors.
Unfortunately the changes are interrelated and cannot easily be
committed in pieces in any meaningful way. Between them and the
already-committed register allocation enhancements (r1765, r1767)
performance of Memcheck is improved by 0%-10%. Improvements are also
applicable to other tools to lesser extents.
Main changes are:
* Add new IR primops Iop_Left64/32/16/8 and Iop_CmpwNEZ64/32/16/8
which Memcheck uses to express some primitive operations on
definedness (V) bits:
Left(x) = set all bits to the left of the rightmost 1 bit to 1
CmpwNEZ(x) = if x == 0 then 0 else 0xFF...FF
Left and CmpwNEZ are detailed in the Usenix 2005 paper (in which
CmpwNEZ is called PCast). The new primops expose opportunities for
IR optimisation at tree-build time. Prior to this change Memcheck
expressed Left and CmpwNEZ in terms of lower level primitives
(logical or, negation, compares, various casts) which was simpler
but hindered further optimisation.
* Enhance the IR optimiser's tree builder so it can rewrite trees
as they are constructed, according to useful identities, for example:
CmpwNEZ64( Or64 ( CmpwNEZ64(x), y ) ) --> CmpwNEZ64( Or64( x, y ) )
which gets rid of a CmpwNEZ64 operation - a win as they are relatively
expensive. See functions fold_IRExpr_Binop and fold_IRExpr_Unop.
Allowing the tree builder to rewrite trees also makes it possible to
have a single implementation of certain transformation rules which
were previously duplicated in the x86, amd64 and ppc instruction
selectors. For example
32to1(1Uto32(x)) --> x
This simplifies the instruction selectors and gives a central place
to put such IR-level transformations, which is a Good Thing.
* Various minor refinements to the instruction selectors:
- ppc64 generates 32Sto64 into 1 instruction instead of 2
- x86 can now generate movsbl
- x86 handles 64-bit integer Mux0X better for cases typically
arising from Memchecking of FP code
- misc other patterns handled better
Overall these changes are a straight win - vex generates less code,
and does so a bit faster since its register allocator has to chew
through fewer instructions. The main risk is that of correctness:
making Left and CmpwNEZ explicit, and adding rewrite rules for them,
is a substantial change in the way Memcheck deals with undefined value
tracking, and I am concerned to ensure that the changes do not cause
false negatives. I _think_ it's all correct so far.
r1770:
Get rid of Iop_Neg64/32/16/8 as they are no longer used by Memcheck,
and any uses as generated by the front ends are so infrequent that
generating the equivalent Sub(0, ..) is good enough. This gets rid of
quite a few lines of code. Add isel cases for Sub(0, ..) patterns so
that the x86/amd64 backends still generate negl/negq where possible.
r1771:
Handle Left64. Fixes failure on none/tests/x86/insn_sse2.
Julian Seward [Sat, 25 Aug 2007 21:29:03 +0000 (21:29 +0000)]
Merge, from CGTUNE branch:
r1768:
Cosmetic (non-functional) changes associated with r1767.
r1767:
Add a second spill-code-avoidance optimisation, which could be called
'directReload' for lack of a better name.
If an instruction reads exactly one vreg which is currently in a spill
slot, and this is last use of that vreg, see if the instruction can be
converted into one that reads directly from the spill slot. This is
clearly only possible for x86 and amd64 targets, since ppc is a
load-store architecture. So, for example,
orl %vreg, %dst
where %vreg is in a spill slot, and this is its last use, would
previously be converted to
movl $spill-offset(%ebp), %tmp
orl %tmp, %dst
whereas now it becomes
orl $spill-offset(%ebp), %dst
This not only avoids an instruction, it eliminates the need for a
reload temporary (%tmp in this example) and so potentially further
reduces spilling.
Implementation is in two parts: an architecture independent part, in
reg_alloc2.c, which finds candidate instructions, and a host dependent
function (directReload_ARCH) for each arch supporting the
optimisation. The directReload_ function does the instruction form
conversion, when possible. Currently only x86 hosts are supported.
As a side effect, change the form of the X86_Test32 instruction from
reg-only to reg/mem so it can participate in such transformations.
This gives a code size reduction of 0.6% for perf/bz2 on x86 memcheck,
but tends to be more effective for long blocks of x86 FP code.
Julian Seward [Sat, 25 Aug 2007 21:11:33 +0000 (21:11 +0000)]
Merge, from CGTUNE branch:
r1765:
During register allocation, keep track of which (real) registers have
the same value as their associated spill slot. Then, if a register
needs to be freed up for some reason, and that register has the same
value as its spill slot, there is no need to produce a spill store.
This substantially reduces the number of spill store instructions
created. Overall gives a 1.9% generated code size reduction for
perf/bz2 running on x86.
r1766:
Followup to r1765: fix some comments, and rearrange fields in struct
RRegState so as to fit it into 16 bytes.
Julian Seward [Tue, 1 May 2007 13:53:01 +0000 (13:53 +0000)]
Stop gcc-4.2 producing hundreds of complaints of the form "warning:
cast from pointer to integer of different size" when compiling on a
64-bit target. gcc-4.2 is correct to complain. An interesting
question is why no previous gcc warned about this.
Julian Seward [Sat, 31 Mar 2007 14:30:12 +0000 (14:30 +0000)]
Teach the x86 back end how generate 'lea' instructions, and generate
them in a couple of places which are important. This reduces the
amount of generated code for memcheck and none by about 1%, and (in
very unscientific tests on perf/bz2) speeds memcheck up by about 1%.
Julian Seward [Sun, 25 Mar 2007 04:14:58 +0000 (04:14 +0000)]
x86 back end: use 80-bit loads/stores for floating point spills rather
than 64-bit ones, to reduce accuracy loss. To support this, in
reg-alloc, allocate 2 64-bit spill slots for each HRcFlt64 vreg
instead of just 1.
Julian Seward [Tue, 20 Mar 2007 14:18:45 +0000 (14:18 +0000)]
x86 front end: synthesise SIGILL in the normal way for some obscure
invalid instruction cases, rather than asserting, as happened in
#143079 and #142279. amd64 equivalents to follow.
Julian Seward [Fri, 9 Mar 2007 18:07:00 +0000 (18:07 +0000)]
When generating 64-bit code, ensure that any addresses used in 4 or 8
byte loads or stores of the form reg+imm have the lowest 2 bits of imm
set to zero, so that they can safely be used in ld/ldu/lda/std/stdu
instructions. This boils down to doing an extra check in
iselWordExpr_AMode and avoiding the reg+imm case in cases where the
amode might end up in any of the abovementioned instructions.
Julian Seward [Sat, 27 Jan 2007 00:46:28 +0000 (00:46 +0000)]
Fill in missing cases in eqIRConst. This stops iropt's CSE pass from
asserting in the presence of V128 immediates, which is a regression
in valgrind 3.2.2.
Julian Seward [Wed, 10 Jan 2007 04:59:33 +0000 (04:59 +0000)]
Implement FXSAVE on amd64. Mysteriously my Athlon64 does not seem to
write all the fields that the AMD documentation says it should: it
skips ROP, RIP and RDP, so vex's implementation writes zeroes there.
Julian Seward [Sun, 24 Dec 2006 02:20:24 +0000 (02:20 +0000)]
A large but non-functional commit: as suggested by Nick, rename some
IR types, structure fields and functions to make IR a bit easier to
understand. Specifically:
dopyIR* -> deepCopyIR*
sopyIR* -> shallowCopyIR*
The presence of a .Tmp union in both IRExpr and IRStmt is
confusing. It has been renamed to RdTmp in IRExpr, reflecting
the fact that here we are getting the value of an IRTemp, and to
WrTmp in IRStmt, reflecting the fact that here we are assigning
to an IRTemp.
IRBB (IR Basic Block) is renamed to IRSB (IR SuperBlock),
reflecting the reality that Vex does not really operate in terms
of basic blocks, but in terms of superblocks - single entry,
multiple exit sequences.
IRArray is renamed to IRRegArray, to make it clearer it refers
to arrays of guest registers and not arrays in memory.
VexMiscInfo is renamed to VexAbiInfo, since that's what it is
-- relevant facts about the ABI (calling conventions, etc) for
both the guest and host platforms.
Julian Seward [Fri, 1 Dec 2006 02:59:17 +0000 (02:59 +0000)]
Change a stupid algorithm that deals with real register live
ranges into a less stupid one. Prior to this change, the complexity
of reg-alloc included an expensive term
O(#instrs in code sequence x #real-register live ranges in code sequence)
This commit changes that term to essentially
O(#instrs in code sequence) + O(time to sort real-reg-L-R array)
On amd64 this nearly halves the cost of register allocation and means
Valgrind performs better in translation-intensive situations (a.k.a
starting programs). Eg, firefox start/exit falls from 119 to 113
seconds. The effect will be larger on ppc32/64 as there are more real
registers and hence real-reg live ranges to consider, and will be
smaller on x86 for the same reason.
The actual code the JIT produces should be unchanged. This commit
merely modifies how the register allocator handles one of its
important data structures.
Julian Seward [Thu, 19 Oct 2006 03:01:09 +0000 (03:01 +0000)]
When doing rlwinm in 64-bit mode, bind the intermediate 32-bit result
to a temporary so it is only computed once. What's there currently
causes it to be computed twice.
Julian Seward [Tue, 17 Oct 2006 00:28:22 +0000 (00:28 +0000)]
Merge r1663-r1666:
- AIX5 build changes
- genoffsets.c: print the offsets of a few more ppc registers
- Get rid of a bunch of ad-hoc hacks which hardwire in certain
assumptions about guest and host ABIs. Instead pass that info
in a VexMiscInfo structure. This cleans up various grotty bits.
- Add to ppc32 guest state, redirection-stack stuff already present
in ppc64 guest state. This is to enable function redirection/
wrapping in the presence of TOC pointers in 32-bit mode.
- Add to both ppc32 and ppc64 guest states, a new pseudo-register
LR_AT_SC. This holds the link register value at the most recent
'sc', so that AIX can back up to restart a syscall if needed.
- Add to both ppc32 and ppc64 guest states, a SPRG3 register.
- Use VexMiscInfo to handle 'sc' on AIX differently from Linux:
on AIX, 'sc' continues at the location stated in the link
register, not at the next insn.
Add support for amd64 'fprem' (fixes bug 132918). This isn't exactly
right; the C3/2/1/0 FPU flags sometimes don't get set the same as
natively, and I can't figure out why.
Julian Seward [Sat, 19 Aug 2006 18:31:53 +0000 (18:31 +0000)]
Comparing a reg with itself produces a result which doesn't depend on
the contents of the reg. Therefore remove the false dependency, which
has been known to cause memcheck to produce false errors for
xlc-compiled code.
Julian Seward [Sun, 21 May 2006 01:02:31 +0000 (01:02 +0000)]
A couple of IR simplification hacks for the amd64 front end, so as to
avoid false errors from memcheck. Analogous to some of the recent
bunch of commits to x86 front end.
Julian Seward [Sun, 14 May 2006 18:46:55 +0000 (18:46 +0000)]
Add an IR folding rule to convert Add32(x,x) into Shl32(x,1). This
fixes #118466 and it also gets rid of a bunch of false positives for
KDE 3.5.2 built by gcc-4.0.2 on x86, of the form shown below.
Use of uninitialised value of size 4
at 0x4BFC342: QIconSet::pixmap(QIconSet::Size, QIconSet::Mode,
QIconSet::State) const (qiconset.cpp:530)
by 0x4555BE7: KToolBarButton::drawButton(QPainter*)
(ktoolbarbutton.cpp:536)
by 0x4CB8A0A: QButton::paintEvent(QPaintEvent*) (qbutton.cpp:887)
Julian Seward [Sat, 13 May 2006 23:08:06 +0000 (23:08 +0000)]
Add specialisation rules to simplify the IR for 'testl .. ; js ..',
'testw .. ; js ..' and 'testb .. ; js ..'. This gets rid of a bunch of
false errors in Memcheck of the form
==2398== Conditional jump or move depends on uninitialised value(s)
==2398== at 0x6C51B61: KHTMLPart::clear() (khtml_part.cpp:1370)
==2398== by 0x6C61A72: KHTMLPart::begin(KURL const&, int, int)
(khtml_part.cpp:1881)