Fix a Makefile issue that I think caused automated testing to fail on
'alvis' last night. I don't no why it worked on the other machines, must be
an automake version thing.
Fix 64-bit Massif breakage, caused by problems with integer arithmetic on
values of different signs and sizes that only a C language lawyer would
spot.
Merged the MASSIF2 branch to the trunk. Main changes:
- ms_main.c: completely overhauled.
- massif/tests/*: lots of them now.
- massif/perf/: added.
- massif/hp2ps: removed. No longer used.
- vg_regtest: renamed the previously unused "posttest" notion to "post".
Using it for checking ms_print's output.
Although the code has changed dramatically, as has the form of the tool's
output, the information presented in the output is basically the same,
although it's now (hopefully) much more useful. So the tool name is
unchanged.
callgrind_control: Fix behavior with callgrind runs of another user
callgrind_control uses files /tmp/callgrind.info.* to be able to
locate running callgrind processes. These files can be read only by
the user which started callgrind. The callgrind_control script
did not check for "permission denied" on opening these files, which
resulted in some unexpected errors. Now, it is checked whether
the "open" was successful, and if not, we skip the according callgrind
process.
callgrind: Use directory in debug info when available
Prepend the file name of a source file with the directory
if that is available. This not only gets rid of problems with the
same file name used in different paths of a project, but lets
the annotation work out of the box without having to specify any
source directory.
Works both with callgrind_annotate and KCachegrind without any
changes there.
Inspired by Nick's change to cachegrind doing the same thing
in r6839 (and gets rid of a FIXME in the source)
Split the OSet interface into two parts: "OSetGen_", which is the existing
interface and provides full power; and "OSetWord_", which is an
easier-to-use interface for if you just want to store words.
ppc32-linux signal handling: don't place the sigframe return stub on
the stack; instead use a stub in m_trampoline.S. This makes it
possible to deliver signals on non-executable stacks, and makes the
behaviour consistent with x86-linux and amd64-linux.
Julian Seward [Wed, 29 Aug 2007 09:11:35 +0000 (09:11 +0000)]
Valgrind-side changes to track vx1786 (which was: Support x86 $int
0x40 .. 0x43 instructions on Linux. Apparently these generate a
segfault and then restart the instruction.)
Julian Seward [Tue, 28 Aug 2007 06:06:27 +0000 (06:06 +0000)]
Merge, from CGTUNE branch, r1774:
Vex-side changes to allow tools to provide a final_tidy function which
they can use to mess with the final post-tree-built IR before it is
handed off to instruction selection.
Julian Seward [Tue, 28 Aug 2007 06:05:20 +0000 (06:05 +0000)]
Merge, from CGTUNE branch, a cleaned up version of r6742:
Another optimisation: allow tools to provide a final_tidy function
which they can use to mess with the final post-tree-built IR before it
is handed off to instruction selection.
In memcheck, use this to remove redundant calls to
MC_(helperc_value_check0_fail) et al. Gives a 6% reduction in code
size for Memcheck on x86 and a smaller (3% ?) speedup.
Julian Seward [Mon, 27 Aug 2007 10:46:39 +0000 (10:46 +0000)]
This module supplies various replacement functions, amongst them a
replacement for index/strchr in ld.so. Unfortunately the replacement
functionality was actually rindex/strrchr and amazingly it has taken
about 2.5 years for anyone to notice.
This fixes the x86-linux case; ppc32-linux and ppc64-linux fixes to
follow.
Julian Seward [Sat, 25 Aug 2007 23:21:08 +0000 (23:21 +0000)]
Merge from CGTUNE branch, code generation improvements for amd64:
r1772:
When generating code for helper calls, be more aggressive about
computing values directly into argument registers, thereby avoiding
some reg-reg shuffling. This reduces the amount of code (on amd64)
generated by Cachegrind by about 6% and has zero or marginal benefit
for other tools.
r1773:
Emit 64-bit branch targets using 32-bit short forms when possible.
Since (with V's default amd64 load address of 0x38000000) this is
usually possible, it saves about 7% in code size for Memcheck and even
more for Cachegrind.
Julian Seward [Sat, 25 Aug 2007 23:07:44 +0000 (23:07 +0000)]
Merge from CGTUNE branch:
r1769:
This commit provides a bunch of enhancements to the IR optimiser
(iropt) and to the various backend instruction selectors.
Unfortunately the changes are interrelated and cannot easily be
committed in pieces in any meaningful way. Between them and the
already-committed register allocation enhancements (r1765, r1767)
performance of Memcheck is improved by 0%-10%. Improvements are also
applicable to other tools to lesser extents.
Main changes are:
* Add new IR primops Iop_Left64/32/16/8 and Iop_CmpwNEZ64/32/16/8
which Memcheck uses to express some primitive operations on
definedness (V) bits:
Left(x) = set all bits to the left of the rightmost 1 bit to 1
CmpwNEZ(x) = if x == 0 then 0 else 0xFF...FF
Left and CmpwNEZ are detailed in the Usenix 2005 paper (in which
CmpwNEZ is called PCast). The new primops expose opportunities for
IR optimisation at tree-build time. Prior to this change Memcheck
expressed Left and CmpwNEZ in terms of lower level primitives
(logical or, negation, compares, various casts) which was simpler
but hindered further optimisation.
* Enhance the IR optimiser's tree builder so it can rewrite trees
as they are constructed, according to useful identities, for example:
CmpwNEZ64( Or64 ( CmpwNEZ64(x), y ) ) --> CmpwNEZ64( Or64( x, y ) )
which gets rid of a CmpwNEZ64 operation - a win as they are relatively
expensive. See functions fold_IRExpr_Binop and fold_IRExpr_Unop.
Allowing the tree builder to rewrite trees also makes it possible to
have a single implementation of certain transformation rules which
were previously duplicated in the x86, amd64 and ppc instruction
selectors. For example
32to1(1Uto32(x)) --> x
This simplifies the instruction selectors and gives a central place
to put such IR-level transformations, which is a Good Thing.
* Various minor refinements to the instruction selectors:
- ppc64 generates 32Sto64 into 1 instruction instead of 2
- x86 can now generate movsbl
- x86 handles 64-bit integer Mux0X better for cases typically
arising from Memchecking of FP code
- misc other patterns handled better
Overall these changes are a straight win - vex generates less code,
and does so a bit faster since its register allocator has to chew
through fewer instructions. The main risk is that of correctness:
making Left and CmpwNEZ explicit, and adding rewrite rules for them,
is a substantial change in the way Memcheck deals with undefined value
tracking, and I am concerned to ensure that the changes do not cause
false negatives. I _think_ it's all correct so far.
r1770:
Get rid of Iop_Neg64/32/16/8 as they are no longer used by Memcheck,
and any uses as generated by the front ends are so infrequent that
generating the equivalent Sub(0, ..) is good enough. This gets rid of
quite a few lines of code. Add isel cases for Sub(0, ..) patterns so
that the x86/amd64 backends still generate negl/negq where possible.
r1771:
Handle Left64. Fixes failure on none/tests/x86/insn_sse2.
Julian Seward [Sat, 25 Aug 2007 21:29:03 +0000 (21:29 +0000)]
Merge, from CGTUNE branch:
r1768:
Cosmetic (non-functional) changes associated with r1767.
r1767:
Add a second spill-code-avoidance optimisation, which could be called
'directReload' for lack of a better name.
If an instruction reads exactly one vreg which is currently in a spill
slot, and this is last use of that vreg, see if the instruction can be
converted into one that reads directly from the spill slot. This is
clearly only possible for x86 and amd64 targets, since ppc is a
load-store architecture. So, for example,
orl %vreg, %dst
where %vreg is in a spill slot, and this is its last use, would
previously be converted to
movl $spill-offset(%ebp), %tmp
orl %tmp, %dst
whereas now it becomes
orl $spill-offset(%ebp), %dst
This not only avoids an instruction, it eliminates the need for a
reload temporary (%tmp in this example) and so potentially further
reduces spilling.
Implementation is in two parts: an architecture independent part, in
reg_alloc2.c, which finds candidate instructions, and a host dependent
function (directReload_ARCH) for each arch supporting the
optimisation. The directReload_ function does the instruction form
conversion, when possible. Currently only x86 hosts are supported.
As a side effect, change the form of the X86_Test32 instruction from
reg-only to reg/mem so it can participate in such transformations.
This gives a code size reduction of 0.6% for perf/bz2 on x86 memcheck,
but tends to be more effective for long blocks of x86 FP code.
Julian Seward [Sat, 25 Aug 2007 21:11:33 +0000 (21:11 +0000)]
Merge, from CGTUNE branch:
r1765:
During register allocation, keep track of which (real) registers have
the same value as their associated spill slot. Then, if a register
needs to be freed up for some reason, and that register has the same
value as its spill slot, there is no need to produce a spill store.
This substantially reduces the number of spill store instructions
created. Overall gives a 1.9% generated code size reduction for
perf/bz2 running on x86.
r1766:
Followup to r1765: fix some comments, and rearrange fields in struct
RRegState so as to fit it into 16 bytes.
Julian Seward [Sat, 25 Aug 2007 07:19:08 +0000 (07:19 +0000)]
Changes to m_hashtable:
Allow hashtables to dynamically resize (patch from Christoph
Bartoschek). Results in the following interface changes:
* HT_construct: no need to supply an initial table size.
Instead, supply a text string used to "name" the table, so
that debugging messages ("resizing the table") can say which
one they are resizing.
* Remove VG_(HT_get_node). This exposes the chain structure to
callers (via the next_ptr parameter), which is a problem since
callers could get some info about the chain structure which then
changes when the table is resized. Fortunately is not used.
* Remove VG_(HT_first_match) and VG_(HT_apply_to_all_nodes) as
they are unused.
* Make the iteration mechanism more paranoid, so any adding or
deleting of nodes part way through an iteration causes VG_(HT_next)
to assert.
* Fix the comment on VG_(HT_to_array) so it no longer speaks
specifically about MC's leak detector.
Julian Seward [Thu, 23 Aug 2007 10:22:44 +0000 (10:22 +0000)]
The drastic increase in the number of per-arena freelists in r6771
exposes a performance problem with doing m_mallocfree.c sanity checks
(at --sanity-level=3, at least), caused by slowness in
listNo_to_pszB_min. This commit fixes the problem by caching the
results of queries to listNo_to_pszB_min.
Julian Seward [Tue, 21 Aug 2007 10:55:26 +0000 (10:55 +0000)]
Previously, each Arena has a linked list of Superblocks, which can
make VG_(arena_free) expensive if many superblocks have to be checked
before the right one is found. This change gives the arena a
dynamically expanding sorted array of superblocks, so that finding the
superblock containing an about-to-be-freed block (findSb) is now
O(log2 n) rather than linear in the number of superblocks in the
arena. Patch from Christoph Bartoschek.
Julian Seward [Mon, 20 Aug 2007 22:57:56 +0000 (22:57 +0000)]
Some improvements for malloc/free intensive programs, inspired by
performance studies by Christoph Bartoschek:
* Increase the number of freelists per arena from 18 to 112, so as
to (drastically) cut down on the amount of freelist searching that
happens.
* Increase the size of the client and tool arenas, so as to reduce
the cost of finding arenas during freeing. This is a kludge; a
better solution would be to use binary search on superblocks, as
Christoph's patches do.
Get rid of VG_(getcwd) and replace it with a pair of functions,
VG_(record_startup_wd) which records the working directory at startup,
and VG_(get_startup_wd) which later tells you what value was recorded.
This works because all uses of VG_(getcwd) serve only to record the
directory at process start anyway. The motivation is that AIX does
not support sys_getcwd directly, so it's easier for the launcher to
ship in the required value using an environment variable. On Linux
sys_getcwd is used as before.
Callgrind manual: rewriting start of section about avoding cycles
This hopefully makes the whole issue with cycles easier to understand.
And no, this does not get rid of the description of cycles, carefully
crafted by Julian ;-)