From: Julian Seward <jseward@acm.org>
Date: Thu, 22 Nov 2007 01:21:56 +0000 (+0000)
Subject: Update documents in preparation for 3.3.0, and restructure them
X-Git-Tag: svn/VALGRIND_3_3_0~92
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=9101880b1f2b8eb5a228c7c7d7cc61e8a20e6c4a;p=thirdparty%2Fvalgrind.git

Update documents in preparation for 3.3.0, and restructure them
somewhat to move less relevant material out of the way to some extent.
The main changes are:

* Update date and version info

* Mention other tools in the quick-start guide

* Document --child-silent-after-fork

* Rearrange order of sections in the Valgrind Core chapter, to move
  advanced stuff (client requests) to the end, and compact stuff
  relevant to the majority of users towards the front

* Move MPI debugging stuff from the Core manual (a nonsensical place
  for it) to the Memcheck chapter

* Update the manual's introductory chapter a bit

* Connect up new tech docs summary page, and disconnect old and
  very out of date valgrind/memcheck tech docs

* Add section tags to the Cachegrind manual, to stop xsltproc
  complaining about their absence



git-svn-id: svn://svn.valgrind.org/valgrind/trunk@7199
---

diff --git a/ACKNOWLEDGEMENTS b/ACKNOWLEDGEMENTS
index 0026d9c3f4..81ffdcf8f6 100644
--- a/ACKNOWLEDGEMENTS
+++ b/ACKNOWLEDGEMENTS
@@ -6,8 +6,9 @@ dynamic-translation framework.
 
 Jeremy Fitzhardinge, jeremy@valgrind.org
 
-Jeremy wrote Helgrind and totally overhauled low-level syscall/signal
-and address space layout stuff, among many other improvements.
+Jeremy wrote Helgrind (in the 2.X line) and totally overhauled
+low-level syscall/signal and address space layout stuff, among many
+other improvements.
 
 Tom Hughes, tom@valgrind.org
 
diff --git a/AUTHORS b/AUTHORS
index eeb0549b72..3c68c2fa07 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -2,8 +2,9 @@
 Cerion Armour-Brown worked on PowerPC instruction set support using
 the Vex dynamic-translation framework.
 
-Jeremy Fitzhardinge wrote Helgrind and totally overhauled low-level
-syscall/signal and address space layout stuff, among many other things.
+Jeremy Fitzhardinge wrote Helgrind (in the 2.X line) and totally
+overhauled low-level syscall/signal and address space layout stuff,
+among many other things.
 
 Tom Hughes did a vast number of bug fixes, and helped out with support
 for more recent Linux/glibc versions.
diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml
index 2a643ecc06..59b70e1d8c 100644
--- a/cachegrind/docs/cg-manual.xml
+++ b/cachegrind/docs/cg-manual.xml
@@ -937,7 +937,7 @@ way as for C/C++ programs.</para>
   
 
 
-<sect2>
+<sect2 id="cg-manual.annopts.warnings" xreflabel="Warnings">
 <title>Warnings</title>
 
 <para>There are a couple of situations in which
@@ -969,7 +969,8 @@ warnings.</para>
 
 
 
-<sect2>
+<sect2 id="cg-manual.annopts.things-to-watch-out-for"
+       xreflabel="Things to watch out for">
 <title>Things to watch out for</title>
 
 <para>Some odd things that can occur during annotation:</para>
@@ -1084,7 +1085,7 @@ rare.</para>
 
 
 
-<sect2>
+<sect2 id="cg-manual.annopts.accuracy" xreflabel="Accuracy">
 <title>Accuracy</title>
 
 <para>Valgrind's cache profiling has a number of
@@ -1221,7 +1222,8 @@ fail these checks.</para>
 </sect1>
 
 
-<sect1>
+<sect1 id="cg-manual.acting-on"
+       xreflabel="Acting on Cachegrind's information">
 <title>Acting on Cachegrind's information</title>
 <para>
 So, you've managed to profile your program with Cachegrind.  Now what?
@@ -1260,14 +1262,16 @@ yourself.  But at least you have the information!
 
 </sect1>
 
-<sect1>
+<sect1 id="cg-manual.impl-details"
+       xreflabel="Implementation details">
 <title>Implementation details</title>
 <para>
 This section talks about details you don't need to know about in order to
 use Cachegrind, but may be of interest to some people.
 </para>
 
-<sect2>
+<sect2 id="cg-manual.impl-details.how-cg-works"
+       xreflabel="How Cachegrind works">
 <title>How Cachegrind works</title>
 <para>The best reference for understanding how Cachegrind works is chapter 3 of
 "Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote.  It
@@ -1275,7 +1279,8 @@ is available on the <ulink url="&vg-pubs;">Valgrind publications
 page</ulink>.</para>
 </sect2>
 
-<sect2>
+<sect2 id="cg-manual.impl-details.file-format"
+       xreflabel="Cachegrind output file format">
 <title>Cachegrind output file format</title>
 <para>The file format is fairly straightforward, basically giving the
 cost centre for every line, grouped by files and
diff --git a/docs/xml/Makefile.am b/docs/xml/Makefile.am
index 7958b25887..791287221b 100644
--- a/docs/xml/Makefile.am
+++ b/docs/xml/Makefile.am
@@ -7,5 +7,6 @@ EXTRA_DIST =  \
 	manual-writing-tools.xml\
 	quick-start-guide.xml	\
 	tech-docs.xml 		\
+	new-tech-docs.xml 	\
 	vg-entities.xml 	\
 	xml_help.txt
diff --git a/docs/xml/manual-core.xml b/docs/xml/manual-core.xml
index addb2bdd20..a7f5d22f8a 100644
--- a/docs/xml/manual-core.xml
+++ b/docs/xml/manual-core.xml
@@ -119,7 +119,7 @@ benefits of higher optimisation levels whilst keeping relatively small the
 chances of false positives or false negatives from Memcheck.  Also, you
 should compile your code with <computeroutput>-Wall</computeroutput> because
 it can identify some or all of the problems that Valgrind can miss at the
-higher optimisations levels.  (Using <computeroutput>-Wall</computeroutput>
+higher optimisation levels.  (Using <computeroutput>-Wall</computeroutput>
 is also a good idea in general.)  All other tools (as far as we know) are
 unaffected by optimisation level.</para>
 
@@ -657,6 +657,25 @@ categories.</para>
     </listitem>
   </varlistentry>
 
+  <varlistentry id="opt.child-silent-after-fork"
+                xreflabel="--child-silent-after-fork">
+    <term>
+      <option><![CDATA[--child-silent-after-fork=<yes|no> [default: no] ]]></option>
+    </term>
+    <listitem>
+      <para>When enabled, Valgrind will not show any debugging or
+      logging output for the child process resulting from
+      a <varname>fork</varname> call.  This can make the output less
+      confusing (although more misleading) when dealing with processes
+      that create children.  It is particularly useful in conjunction
+      with <varname>--trace-children=</varname>.  Use of this flag is also
+      strongly recommended if you are requesting XML output
+      (<varname>--xml=yes</varname>), since otherwise the XML from child and
+      parent may become mixed up, which usually makes it useless.
+      </para>
+    </listitem>
+  </varlistentry>
+
   <varlistentry id="opt.track-fds" xreflabel="--track-fds">
     <term>
       <option><![CDATA[--track-fds=<yes|no> [default: no] ]]></option>
@@ -988,6 +1007,10 @@ that can report errors, e.g. Memcheck, but not Cachegrind.</para>
       process to be debugged and each instance of <literal>%f</literal>
       expands to the path to the executable for the process to be
       debugged.</para>
+
+      <para>Since <computeroutput>&lt;command&gt;</computeroutput> is likely
+      to contain spaces, you will need to put this entire flag in
+      quotes to ensure it is correctly handled by the shell.</para>
     </listitem>
   </varlistentry>
 
@@ -1273,254 +1296,6 @@ don't understand
 </sect1>
 
 
-<sect1 id="manual-core.clientreq" 
-       xreflabel="The Client Request mechanism">
-<title>The Client Request mechanism</title>
-
-<para>Valgrind has a trapdoor mechanism via which the client
-program can pass all manner of requests and queries to Valgrind
-and the current tool.  Internally, this is used extensively to
-make malloc, free, etc, work, although you don't see that.</para>
-
-<para>For your convenience, a subset of these so-called client
-requests is provided to allow you to tell Valgrind facts about
-the behaviour of your program, and also to make queries.
-In particular, your program can tell Valgrind about changes in
-memory range permissions that Valgrind would not otherwise know
-about, and so allows clients to get Valgrind to do arbitrary
-custom checks.</para>
-
-<para>Clients need to include a header file to make this work.
-Which header file depends on which client requests you use.  Some
-client requests are handled by the core, and are defined in the
-header file <filename>valgrind/valgrind.h</filename>.  Tool-specific
-header files are named after the tool, e.g.
-<filename>valgrind/memcheck.h</filename>.  All header files can be found
-in the <literal>include/valgrind</literal> directory of wherever Valgrind
-was installed.</para>
-
-<para>The macros in these header files have the magical property
-that they generate code in-line which Valgrind can spot.
-However, the code does nothing when not run on Valgrind, so you
-are not forced to run your program under Valgrind just because you
-use the macros in this file.  Also, you are not required to link your
-program with any extra supporting libraries.</para>
-
-<para>The code added to your binary has negligible performance impact:
-on x86, amd64, ppc32 and ppc64, the overhead is 6 simple integer instructions
-and is probably undetectable except in tight loops.
-However, if you really wish to compile out the client requests, you can
-compile with <computeroutput>-DNVALGRIND</computeroutput> (analogous to
-<computeroutput>-DNDEBUG</computeroutput>'s effect on
-<computeroutput>assert()</computeroutput>).
-</para>
-
-<para>You are encouraged to copy the <filename>valgrind/*.h</filename> headers
-into your project's include directory, so your program doesn't have a
-compile-time dependency on Valgrind being installed.  The Valgrind headers,
-unlike most of the rest of the code, are under a BSD-style license so you may
-include them without worrying about license incompatibility.</para>
-
-<para>Here is a brief description of the macros available in
-<filename>valgrind.h</filename>, which work with more than one
-tool (see the tool-specific documentation for explanations of the
-tool-specific macros).</para>
-
- <variablelist>
-
-  <varlistentry>
-   <term><command><computeroutput>RUNNING_ON_VALGRIND</computeroutput></command>:</term>
-   <listitem>
-    <para>Returns 1 if running on Valgrind, 0 if running on the
-    real CPU.  If you are running Valgrind on itself, returns the
-    number of layers of Valgrind emulation you're running on.
-    </para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_DISCARD_TRANSLATIONS</computeroutput>:</command></term>
-   <listitem>
-    <para>Discards translations of code in the specified address
-    range.  Useful if you are debugging a JIT compiler or some other
-    dynamic code generation system.  After this call, attempts to
-    execute code in the invalidated address range will cause
-    Valgrind to make new translations of that code, which is
-    probably the semantics you want.  Note that code invalidations
-    are expensive because finding all the relevant translations
-    quickly is very difficult.  So try not to call it often.
-    Note that you can be clever about
-    this: you only need to call it when an area which previously
-    contained code is overwritten with new code.  You can choose
-    to write code into fresh memory, and just call this
-    occasionally to discard large chunks of old code all at
-    once.</para>
-    <para>
-    Alternatively, for transparent self-modifying-code support,
-    use<computeroutput>--smc-check=all</computeroutput>, or run
-    on ppc32/Linux or ppc64/Linux.
-    </para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_COUNT_ERRORS</computeroutput>:</command></term>
-   <listitem>
-    <para>Returns the number of errors found so far by Valgrind.  Can be
-    useful in test harness code when combined with the
-    <option>--log-fd=-1</option> option; this runs Valgrind silently,
-    but the client program can detect when errors occur.  Only useful
-    for tools that report errors, e.g. it's useful for Memcheck, but for
-    Cachegrind it will always return zero because Cachegrind doesn't
-    report errors.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>:</command></term>
-   <listitem>
-    <para>If your program manages its own memory instead of using
-    the standard <computeroutput>malloc()</computeroutput> /
-    <computeroutput>new</computeroutput> /
-    <computeroutput>new[]</computeroutput>, tools that track
-    information about heap blocks will not do nearly as good a
-    job.  For example, Memcheck won't detect nearly as many
-    errors, and the error messages won't be as informative.  To
-    improve this situation, use this macro just after your custom
-    allocator allocates some new memory.  See the comments in
-    <filename>valgrind.h</filename> for information on how to use
-    it.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_FREELIKE_BLOCK</computeroutput>:</command></term>
-   <listitem>
-    <para>This should be used in conjunction with
-    <computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>.
-    Again, see <filename>memcheck/memcheck.h</filename> for
-    information on how to use it.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>:</command></term>
-   <listitem>
-    <para>This is similar to
-    <computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>,
-    but is tailored towards code that uses memory pools.  See the
-    comments in <filename>valgrind.h</filename> for information
-    on how to use it.</para>
-   </listitem>
-  </varlistentry>
-  
-  <varlistentry>
-  <term><command><computeroutput>VALGRIND_DESTROY_MEMPOOL</computeroutput>:</command></term>
-   <listitem>
-    <para>This should be used in conjunction with
-    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
-    Again, see the comments in <filename>valgrind.h</filename> for
-    information on how to use it.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_MEMPOOL_ALLOC</computeroutput>:</command></term>
-   <listitem>
-    <para>This should be used in conjunction with
-    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
-    Again, see the comments in <filename>valgrind.h</filename> for
-    information on how to use it.</para>
-   </listitem>
-  </varlistentry>
-   
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_MEMPOOL_FREE</computeroutput>:</command></term>
-   <listitem>
-    <para>This should be used in conjunction with
-    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
-    Again, see the comments in <filename>valgrind.h</filename> for
-    information on how to use it.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_NON_SIMD_CALL[0123]</computeroutput>:</command></term>
-   <listitem>
-    <para>Executes a function of 0, 1, 2 or 3 args in the client
-    program on the <emphasis>real</emphasis> CPU, not the virtual
-    CPU that Valgrind normally runs code on.  These are used in
-    various ways internally to Valgrind.  They might be useful to
-    client programs.</para> 
-
-    <para><command>Warning:</command> Only use these if you
-    <emphasis>really</emphasis> know what you are doing.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_PRINTF(format, ...)</computeroutput>:</command></term>
-   <listitem>
-    <para>printf a message to the log file when running under
-    Valgrind.  Nothing is output if not running under Valgrind.
-    Returns the number of characters output.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_PRINTF_BACKTRACE(format, ...)</computeroutput>:</command></term>
-   <listitem>
-    <para>printf a message to the log file along with a stack
-    backtrace when running under Valgrind.  Nothing is output if
-    not running under Valgrind.  Returns the number of characters
-    output.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_STACK_REGISTER(start, end)</computeroutput>:</command></term>
-   <listitem>
-    <para>Registers a new stack.  Informs Valgrind that the memory range
-    between start and end is a unique stack.  Returns a stack identifier
-    that can be used with other
-    <computeroutput>VALGRIND_STACK_*</computeroutput> calls.</para>
-    <para>Valgrind will use this information to determine if a change to
-    the stack pointer is an item pushed onto the stack or a change over
-    to a new stack.  Use this if you're using a user-level thread package
-    and are noticing spurious errors from Valgrind about uninitialized
-    memory reads.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_STACK_DEREGISTER(id)</computeroutput>:</command></term>
-   <listitem>
-    <para>Deregisters a previously registered stack.  Informs
-    Valgrind that previously registered memory range with stack id
-    <computeroutput>id</computeroutput> is no longer a stack.</para>
-   </listitem>
-  </varlistentry>
-
-  <varlistentry>
-   <term><command><computeroutput>VALGRIND_STACK_CHANGE(id, start, end)</computeroutput>:</command></term>
-   <listitem>
-    <para>Changes a previously registered stack.  Informs
-    Valgrind that the previously registered stack with stack id
-    <computeroutput>id</computeroutput> has changed its start and end
-    values.  Use this if your user-level thread package implements
-    stack growth.</para>
-   </listitem>
-  </varlistentry>
-
- </variablelist>
-
-<para>Note that <filename>valgrind.h</filename> is included by
-all the tool-specific header files (such as
-<filename>memcheck.h</filename>), so you don't need to include it
-in your client if you include a tool-specific header.</para>
-
-</sect1>
-
 
 
 <sect1 id="manual-core.pthreads" xreflabel="Support for Threads">
@@ -1528,7 +1303,7 @@ in your client if you include a tool-specific header.</para>
 
 <para>Valgrind supports programs which use POSIX pthreads.
 Getting this to work was technically challenging but it now works
-well enough for significant threaded applications to work.</para>
+well enough for significant threaded applications to run.</para>
 
 <para>The main thing to point out is that although Valgrind works
 with the standard Linux threads library (eg. NPTL or LinuxThreads), it
@@ -1544,7 +1319,8 @@ every 100000 basic blocks (on x86, typically around 600000
 instructions), which means you'll get a much finer interleaving
 of thread executions than when run natively.  This in itself may
 cause your program to behave differently if you have some kind of
-concurrency, critical race, locking, or similar, bugs.</para>
+concurrency, critical race, locking, or similar, bugs.  In that case
+you might consider using Valgrind's Helgrind tool to track them down.</para>
 
 <para>Your program will use the native
 <computeroutput>libpthread</computeroutput>, but not all of its facilities
@@ -1595,1203 +1371,1085 @@ will create a core dump in the usual way.</para>
 
 
 
-<sect1 id="manual-core.wrapping" xreflabel="Function Wrapping">
-<title>Function wrapping</title>
 
-<para>
-Valgrind versions 3.2.0 and above can do function wrapping on all
-supported targets.  In function wrapping, calls to some specified
-function are intercepted and rerouted to a different, user-supplied
-function.  This can do whatever it likes, typically examining the
-arguments, calling onwards to the original, and possibly examining the
-result.  Any number of functions may be wrapped.</para>
 
-<para>
-Function wrapping is useful for instrumenting an API in some way.  For
-example, wrapping functions in the POSIX pthreads API makes it
-possible to notify Valgrind of thread status changes, and wrapping
-functions in the MPI (message-passing) API allows notifying Valgrind
-of memory status changes associated with message arrival/departure.
-Such information is usually passed to Valgrind by using client
-requests in the wrapper functions, although that is not of relevance
-here.</para>
 
-<sect2 id="manual-core.wrapping.example" xreflabel="A Simple Example">
-<title>A Simple Example</title>
 
-<para>Supposing we want to wrap some function</para>
 
-<programlisting><![CDATA[
-int foo ( int x, int y ) { return x + y; }]]></programlisting>
+<sect1 id="manual-core.install" xreflabel="Building and Installing">
+<title>Building and Installing Valgrind</title>
 
-<para>A wrapper is a function of identical type, but with a special name
-which identifies it as the wrapper for <computeroutput>foo</computeroutput>.
-Wrappers need to include
-supporting macros from <computeroutput>valgrind.h</computeroutput>.
-Here is a simple wrapper which prints the arguments and return value:</para>
+<para>We use the standard Unix
+<computeroutput>./configure</computeroutput>,
+<computeroutput>make</computeroutput>, <computeroutput>make
+install</computeroutput> mechanism, and we have attempted to
+ensure that it works on machines with kernel 2.4 or 2.6 and glibc
+2.2.X to 2.5.X.  Once you have completed 
+<computeroutput>make install</computeroutput> you may then want 
+to run the regression tests
+with <computeroutput>make regtest</computeroutput>.
+</para>
 
-<programlisting><![CDATA[
-#include <stdio.h>
-#include "valgrind.h"
-int I_WRAP_SONAME_FNNAME_ZU(NONE,foo)( int x, int y )
-{
-   int    result;
-   OrigFn fn;
-   VALGRIND_GET_ORIG_FN(fn);
-   printf("foo's wrapper: args %d %d\n", x, y);
-   CALL_FN_W_WW(result, fn, x,y);
-   printf("foo's wrapper: result %d\n", result);
-   return result;
-}
-]]></programlisting>
+<para>There are five options (in addition to the usual
+<option>--prefix=</option> which affect how Valgrind is built:
+<itemizedlist>
 
-<para>To become active, the wrapper merely needs to be present in a text
-section somewhere in the same process' address space as the function
-it wraps, and for its ELF symbol name to be visible to Valgrind.  In
-practice, this means either compiling to a 
-<computeroutput>.o</computeroutput> and linking it in, or
-compiling to a <computeroutput>.so</computeroutput> and 
-<computeroutput>LD_PRELOAD</computeroutput>ing it in.  The latter is more
-convenient in that it doesn't require relinking.</para>
+  <listitem>
+    <para><option>--enable-inner</option></para>
+    <para>This builds Valgrind with some special magic hacks which make
+     it possible to run it on a standard build of Valgrind (what the
+     developers call "self-hosting").  Ordinarily you should not use
+     this flag as various kinds of safety checks are disabled.
+   </para>
+  </listitem>
 
-<para>All wrappers have approximately the above form.  There are three
-crucial macros:</para>
+  <listitem>
+    <para><option>--enable-tls</option></para>
+    <para>TLS (Thread Local Storage) is a relatively new mechanism which
+    requires compiler, linker and kernel support.  Valgrind tries to
+    automatically test if TLS is supported and if so enables this option.
+    Sometimes it cannot test for TLS, so this option allows you to
+    override the automatic test.</para>
+  </listitem>
 
-<para><computeroutput>I_WRAP_SONAME_FNNAME_ZU</computeroutput>: 
-this generates the real name of the wrapper.
-This is an encoded name which Valgrind notices when reading symbol
-table information.  What it says is: I am the wrapper for any function
-named <computeroutput>foo</computeroutput> which is found in 
-an ELF shared object with an empty
-("<computeroutput>NONE</computeroutput>") soname field.  The specification 
-mechanism is powerful in
-that wildcards are allowed for both sonames and function names.  
-The details are discussed below.</para>
+  <listitem>
+    <para><option>--with-vex=</option></para>
+    <para>Specifies the path to the underlying VEX dynamic-translation
+     library.  By default this is taken to be in the VEX directory off
+     the root of the source tree.
+   </para>
+  </listitem>
 
-<para><computeroutput>VALGRIND_GET_ORIG_FN</computeroutput>: 
-once in the the wrapper, the first priority is
-to get hold of the address of the original (and any other supporting
-information needed).  This is stored in a value of opaque 
-type <computeroutput>OrigFn</computeroutput>.
-The information is acquired using 
-<computeroutput>VALGRIND_GET_ORIG_FN</computeroutput>.  It is crucial
-to make this macro call before calling any other wrapped function
-in the same thread.</para>
+  <listitem>
+    <para><option>--enable-only64bit</option></para>
+    <para><option>--enable-only32bit</option></para>
+    <para>On 64-bit
+     platforms (amd64-linux, ppc64-linux), Valgrind is by default built
+     in such a way that both 32-bit and 64-bit executables can be run.
+     Sometimes this cleverness is a problem for a variety of reasons.
+     These two flags allow for single-target builds in this situation.
+     If you issue both, the configure script will complain.  Note they
+     are ignored on 32-bit-only platforms (x86-linux, ppc32-linux).
+   </para>
+  </listitem>
 
-<para><computeroutput>CALL_FN_W_WW</computeroutput>: eventually we will
-want to call the function being
-wrapped.  Calling it directly does not work, since that just gets us
-back to the wrapper and tends to kill the program in short order by
-stack overflow.  Instead, the result lvalue, 
-<computeroutput>OrigFn</computeroutput> and arguments are
-handed to one of a family of macros of the form 
-<computeroutput>CALL_FN_*</computeroutput>.  These
-cause Valgrind to call the original and avoid recursion back to the
-wrapper.</para>
-</sect2>
+</itemizedlist>
+</para>
 
-<sect2 id="manual-core.wrapping.specs" xreflabel="Wrapping Specifications">
-<title>Wrapping Specifications</title>
+<para>The <computeroutput>configure</computeroutput> script tests
+the version of the X server currently indicated by the current
+<computeroutput>$DISPLAY</computeroutput>.  This is a known bug.
+The intention was to detect the version of the current X
+client libraries, so that correct suppressions could be selected
+for them, but instead the test checks the server version.  This
+is just plain wrong.</para>
 
-<para>This scheme has the advantage of being self-contained.  A library of
-wrappers can be compiled to object code in the normal way, and does
-not rely on an external script telling Valgrind which wrappers pertain
-to which originals.</para>
+<para>If you are building a binary package of Valgrind for
+distribution, please read <literal>README_PACKAGERS</literal>
+<xref linkend="dist.readme-packagers"/>.  It contains some
+important information.</para>
 
-<para>Each wrapper has a name which, in the most general case says: I am the
-wrapper for any function whose name matches FNPATT and whose ELF
-"soname" matches SOPATT.  Both FNPATT and SOPATT may contain wildcards
-(asterisks) and other characters (spaces, dots, @, etc) which are not 
-generally regarded as valid C identifier names.</para> 
+<para>Apart from that, there's not much excitement here.  Let us
+know if you have build problems.</para>
 
-<para>This flexibility is needed to write robust wrappers for POSIX pthread
-functions, where typically we are not completely sure of either the
-function name or the soname, or alternatively we want to wrap a whole
-set of functions at once.</para> 
+</sect1>
 
-<para>For example, <computeroutput>pthread_create</computeroutput> 
-in GNU libpthread is usually a
-versioned symbol - one whose name ends in, eg, 
-<computeroutput>@GLIBC_2.3</computeroutput>.  Hence we
-are not sure what its real name is.  We also want to cover any soname
-of the form <computeroutput>libpthread.so*</computeroutput>.
-So the header of the wrapper will be</para>
 
-<programlisting><![CDATA[
-int I_WRAP_SONAME_FNNAME_ZZ(libpthreadZdsoZd0,pthreadZucreateZAZa)
-  ( ... formals ... )
-  { ... body ... }
-]]></programlisting>
 
-<para>In order to write unusual characters as valid C function names, a
-Z-encoding scheme is used.  Names are written literally, except that
-a capital Z acts as an escape character, with the following encoding:</para>
+<sect1 id="manual-core.problems" xreflabel="If You Have Problems">
+<title>If You Have Problems</title>
 
-<programlisting><![CDATA[
-     Za   encodes    *
-     Zp              +
-     Zc              :
-     Zd              .
-     Zu              _
-     Zh              -
-     Zs              (space)
-     ZA              @
-     ZZ              Z
-     ZL              (       # only in valgrind 3.3.0 and later
-     ZR              )       # only in valgrind 3.3.0 and later
-]]></programlisting>
+<para>Contact us at <ulink url="&vg-url;">&vg-url;</ulink>.</para>
 
-<para>Hence <computeroutput>libpthreadZdsoZd0</computeroutput> is an 
-encoding of the soname <computeroutput>libpthread.so.0</computeroutput>
-and <computeroutput>pthreadZucreateZAZa</computeroutput> is an encoding 
-of the function name <computeroutput>pthread_create@*</computeroutput>.
-</para>
+<para>See <xref linkend="manual-core.limits"/> for the known
+limitations of Valgrind, and for a list of programs which are
+known not to work on it.</para>
 
-<para>The macro <computeroutput>I_WRAP_SONAME_FNNAME_ZZ</computeroutput> 
-constructs a wrapper name in which
-both the soname (first component) and function name (second component)
-are Z-encoded.  Encoding the function name can be tiresome and is
-often unnecessary, so a second macro,
-<computeroutput>I_WRAP_SONAME_FNNAME_ZU</computeroutput>, can be
-used instead.  The <computeroutput>_ZU</computeroutput> variant is 
-also useful for writing wrappers for
-C++ functions, in which the function name is usually already mangled
-using some other convention in which Z plays an important role.  Having
-to encode a second time quickly becomes confusing.</para>
+<para>All parts of the system make heavy use of assertions and 
+internal self-checks.  They are permanently enabled, and we have no 
+plans to disable them.  If one of them breaks, please mail us!</para>
 
-<para>Since the function name field may contain wildcards, it can be
-anything, including just <computeroutput>*</computeroutput>.
-The same is true for the soname.
-However, some ELF objects - specifically, main executables - do not
-have sonames.  Any object lacking a soname is treated as if its soname
-was <computeroutput>NONE</computeroutput>, which is why the original 
-example above had a name
-<computeroutput>I_WRAP_SONAME_FNNAME_ZU(NONE,foo)</computeroutput>.</para>
+<para>If you get an assertion failure
+in <filename>m_mallocfree.c</filename>, this may have happened because
+your program wrote off the end of a malloc'd block, or before its
+beginning.  Valgrind hopefully will have emitted a proper message to that
+effect before dying in this way.  This is a known problem which
+we should fix.</para>
 
-<para>Note that the soname of an ELF object is not the same as its
-file name, although it is often similar.  You can find the soname of
-an object <computeroutput>libfoo.so</computeroutput> using the command
-<computeroutput>readelf -a libfoo.so | grep soname</computeroutput>.</para>
-</sect2>
+<para>Read the <xref linkend="FAQ"/> for more advice about common problems, 
+crashes, etc.</para>
 
-<sect2 id="manual-core.wrapping.semantics" xreflabel="Wrapping Semantics">
-<title>Wrapping Semantics</title>
+</sect1>
 
-<para>The ability for a wrapper to replace an infinite family of functions
-is powerful but brings complications in situations where ELF objects
-appear and disappear (are dlopen'd and dlclose'd) on the fly.
-Valgrind tries to maintain sensible behaviour in such situations.</para>
 
-<para>For example, suppose a process has dlopened (an ELF object with
-soname) <computeroutput>object1.so</computeroutput>, which contains 
-<computeroutput>function1</computeroutput>.  It starts to use
-<computeroutput>function1</computeroutput> immediately.</para>
 
-<para>After a while it dlopens <computeroutput>wrappers.so</computeroutput>,
-which contains a wrapper
-for <computeroutput>function1</computeroutput> in (soname) 
-<computeroutput>object1.so</computeroutput>.  All subsequent calls to 
-<computeroutput>function1</computeroutput> are rerouted to the wrapper.</para>
+<sect1 id="manual-core.limits" xreflabel="Limitations">
+<title>Limitations</title>
 
-<para>If <computeroutput>wrappers.so</computeroutput> is 
-later dlclose'd, calls to <computeroutput>function1</computeroutput> are 
-naturally routed back to the original.</para>
+<para>The following list of limitations seems long.  However, most
+programs actually work fine.</para>
 
-<para>Alternatively, if <computeroutput>object1.so</computeroutput>
-is dlclose'd but wrappers.so remains,
-then the wrapper exported by <computeroutput>wrapper.so</computeroutput>
-becomes inactive, since there
-is no way to get to it - there is no original to call any more.  However,
-Valgrind remembers that the wrapper is still present.  If 
-<computeroutput>object1.so</computeroutput> is
-eventually dlopen'd again, the wrapper will become active again.</para>
+<para>Valgrind will run Linux ELF binaries, on a kernel 2.4.X or 2.6.X
+system, on the x86, amd64, ppc32 and ppc64 architectures, subject to the
+following constraints:</para>
 
-<para>In short, valgrind inspects all code loading/unloading events to
-ensure that the set of currently active wrappers remains consistent.</para>
+ <itemizedlist>
+  <listitem>
+   <para>On x86 and amd64, there is no support for 3DNow! instructions.
+   If the translator encounters these, Valgrind will generate a SIGILL
+   when the instruction is executed.  Apart from that, on x86 and amd64,
+   essentially all instructions are supported, up to and including SSE3.
+   </para>
 
-<para>A second possible problem is that of conflicting wrappers.  It is 
-easily possible to load two or more wrappers, both of which claim
-to be wrappers for some third function.  In such cases Valgrind will
-complain about conflicting wrappers when the second one appears, and
-will honour only the first one.</para>
-</sect2>
+   <para>On ppc32 and ppc64, almost all integer, floating point and Altivec
+   instructions are supported.  Specifically: integer and FP insns that are
+   mandatory for PowerPC, the "General-purpose optional" group (fsqrt, fsqrts,
+   stfiwx), the "Graphics optional" group (fre, fres, frsqrte, frsqrtes), and
+   the Altivec (also known as VMX) SIMD instruction set, are supported.</para>
+  </listitem>
 
-<sect2 id="manual-core.wrapping.debugging" xreflabel="Debugging">
-<title>Debugging</title>
+  <listitem>
+   <para>Atomic instruction sequences are not properly supported, in the
+   sense that their atomicity is not preserved.  This will affect any
+   use of synchronization via memory shared between processes.  They
+   will appear to work, but fail sporadically.</para>
+  </listitem>
 
-<para>Figuring out what's going on given the dynamic nature of wrapping
-can be difficult.  The 
-<computeroutput>--trace-redir=yes</computeroutput> flag makes 
-this possible
-by showing the complete state of the redirection subsystem after
-every
-<computeroutput>mmap</computeroutput>/<computeroutput>munmap</computeroutput>
-event affecting code (text).</para>
+  <listitem>
+   <para>If your program does its own memory management, rather than
+   using malloc/new/free/delete, it should still work, but Memcheck's
+   error checking won't be so effective.  If you describe your program's
+   memory management scheme using "client requests" 
+   (see <xref linkend="manual-core.clientreq"/>), Memcheck can do
+   better.  Nevertheless, using malloc/new and free/delete is still the
+   best approach.</para>
+  </listitem>
 
-<para>There are two central concepts:</para>
+  <listitem>
+   <para>Valgrind's signal simulation is not as robust as it could be.
+   Basic POSIX-compliant sigaction and sigprocmask functionality is
+   supplied, but it's conceivable that things could go badly awry if you
+   do weird things with signals.  Workaround: don't.  Programs that do
+   non-POSIX signal tricks are in any case inherently unportable, so
+   should be avoided if possible.</para>
+  </listitem>
 
-<itemizedlist>
+  <listitem>
+   <para>Machine instructions, and system calls, have been implemented
+   on demand.  So it's possible, although unlikely, that a program will
+   fall over with a message to that effect.  If this happens, please
+   report all the details printed out, so we can try and implement the
+   missing feature.</para>
+  </listitem>
 
-  <listitem><para>A "redirection specification" is a binding of 
-  a (soname pattern, fnname pattern) pair to a code address.
-  These bindings are created by writing functions with names
-  made with the 
-  <computeroutput>I_WRAP_SONAME_FNNAME_{ZZ,_ZU}</computeroutput>
-  macros.</para></listitem>
+  <listitem>
+   <para>Memory consumption of your program is majorly increased whilst
+   running under Valgrind.  This is due to the large amount of
+   administrative information maintained behind the scenes.  Another
+   cause is that Valgrind dynamically translates the original
+   executable.  Translated, instrumented code is 12-18 times larger than
+   the original so you can easily end up with 50+ MB of translations
+   when running (eg) a web browser.</para>
+  </listitem>
 
-  <listitem><para>An "active redirection" is code-address to 
-  code-address binding currently in effect.</para></listitem>
+  <listitem>
+   <para>Valgrind can handle dynamically-generated code just fine.  If
+   you regenerate code over the top of old code (ie. at the same memory
+   addresses), if the code is on the stack Valgrind will realise the
+   code has changed, and work correctly.  This is necessary to handle
+   the trampolines GCC uses to implemented nested functions.  If you
+   regenerate code somewhere other than the stack, you will need to use
+   the <option>--smc-check=all</option> flag, and Valgrind will run more
+   slowly than normal.</para>
+  </listitem>
 
-</itemizedlist>
+  <listitem>
+   <para>As of version 3.0.0, Valgrind has the following limitations
+   in its implementation of x86/AMD64 floating point relative to 
+   IEEE754.</para>
 
-<para>The state of the wrapping-and-redirection subsystem comprises a set of
-specifications and a set of active bindings.  The specifications are
-acquired/discarded by watching all 
-<computeroutput>mmap</computeroutput>/<computeroutput>munmap</computeroutput>
-events on code (text)
-sections.  The active binding set is (conceptually) recomputed from
-the specifications, and all known symbol names, following any change
-to the specification set.</para>
+   <para>Precision: There is no support for 80 bit arithmetic.
+   Internally, Valgrind represents all such "long double" numbers in 64
+   bits, and so there may be some differences in results.  Whether or
+   not this is critical remains to be seen.  Note, the x86/amd64
+   fldt/fstpt instructions (read/write 80-bit numbers) are correctly
+   simulated, using conversions to/from 64 bits, so that in-memory
+   images of 80-bit numbers look correct if anyone wants to see.</para>
 
-<para><computeroutput>--trace-redir=yes</computeroutput> shows the contents 
-of both sets following any such event.</para>
+   <para>The impression observed from many FP regression tests is that
+   the accuracy differences aren't significant.  Generally speaking, if
+   a program relies on 80-bit precision, there may be difficulties
+   porting it to non x86/amd64 platforms which only support 64-bit FP
+   precision.  Even on x86/amd64, the program may get different results
+   depending on whether it is compiled to use SSE2 instructions (64-bits
+   only), or x87 instructions (80-bit).  The net effect is to make FP
+   programs behave as if they had been run on a machine with 64-bit IEEE
+   floats, for example PowerPC.  On amd64 FP arithmetic is done by
+   default on SSE2, so amd64 looks more like PowerPC than x86 from an FP
+   perspective, and there are far fewer noticeable accuracy differences
+   than with x86.</para>
 
-<para><computeroutput>-v</computeroutput> prints a line of text each 
-time an active specification is used for the first time.</para>
+   <para>Rounding: Valgrind does observe the 4 IEEE-mandated rounding
+   modes (to nearest, to +infinity, to -infinity, to zero) for the
+   following conversions: float to integer, integer to float where
+   there is a possibility of loss of precision, and float-to-float
+   rounding.  For all other FP operations, only the IEEE default mode
+   (round to nearest) is supported.</para>
 
-<para>Hence for maximum debugging effectiveness you will need to use both
-flags.</para>
+   <para>Numeric exceptions in FP code: IEEE754 defines five types of
+   numeric exception that can happen: invalid operation (sqrt of
+   negative number, etc), division by zero, overflow, underflow,
+   inexact (loss of precision).</para>
 
-<para>One final comment.  The function-wrapping facility is closely
-tied to Valgrind's ability to replace (redirect) specified
-functions, for example to redirect calls to 
-<computeroutput>malloc</computeroutput> to its
-own implementation.  Indeed, a replacement function can be
-regarded as a wrapper function which does not call the original.
-However, to make the implementation more robust, the two kinds
-of interception (wrapping vs replacement) are treated differently.
-</para>
+   <para>For each exception, two courses of action are defined by IEEE754:
+   either (1) a user-defined exception handler may be called, or (2) a
+   default action is defined, which "fixes things up" and allows the
+   computation to proceed without throwing an exception.</para>
 
-<para><computeroutput>--trace-redir=yes</computeroutput> shows 
-specifications and bindings for both
-replacement and wrapper functions.  To differentiate the 
-two, replacement bindings are printed using 
-<computeroutput>R-></computeroutput> whereas 
-wraps are printed using <computeroutput>W-></computeroutput>.
-</para>
-</sect2>
+   <para>Currently Valgrind only supports the default fixup actions.
+   Again, feedback on the importance of exception support would be
+   appreciated.</para>
 
+   <para>When Valgrind detects that the program is trying to exceed any
+   of these limitations (setting exception handlers, rounding mode, or
+   precision control), it can print a message giving a traceback of
+   where this has happened, and continue execution.  This behaviour used
+   to be the default, but the messages are annoying and so showing them
+   is now disabled by default.  Use <option>--show-emwarns=yes</option> to see
+   them.</para>
 
-<sect2 id="manual-core.wrapping.limitations-cf" 
-       xreflabel="Limitations - control flow">
-<title>Limitations - control flow</title>
+   <para>The above limitations define precisely the IEEE754 'default'
+   behaviour: default fixup on all exceptions, round-to-nearest
+   operations, and 64-bit precision.</para>
+  </listitem>
+   
+  <listitem>
+   <para>As of version 3.0.0, Valgrind has the following limitations in
+   its implementation of x86/AMD64 SSE2 FP arithmetic, relative to 
+   IEEE754.</para>
 
-<para>For the most part, the function wrapping implementation is robust.
-The only important caveat is: in a wrapper, get hold of
-the <computeroutput>OrigFn</computeroutput> information using 
-<computeroutput>VALGRIND_GET_ORIG_FN</computeroutput> before calling any
-other wrapped function.  Once you have the 
-<computeroutput>OrigFn</computeroutput>, arbitrary
-calls between, recursion between, and longjumps out of wrappers
-should work correctly.  There is never any interaction between wrapped
-functions and merely replaced functions 
-(eg <computeroutput>malloc</computeroutput>), so you can call
-<computeroutput>malloc</computeroutput> etc safely from within wrappers.
-</para>
+   <para>Essentially the same: no exceptions, and limited observance of
+   rounding mode.  Also, SSE2 has control bits which make it treat
+   denormalised numbers as zero (DAZ) and a related action, flush
+   denormals to zero (FTZ).  Both of these cause SSE2 arithmetic to be
+   less accurate than IEEE requires.  Valgrind detects, ignores, and can
+   warn about, attempts to enable either mode.</para>
+  </listitem>
 
-<para>The above comments are true for {x86,amd64,ppc32}-linux.  On
-ppc64-linux function wrapping is more fragile due to the (arguably
-poorly designed) ppc64-linux ABI.  This mandates the use of a shadow
-stack which tracks entries/exits of both wrapper and replacement
-functions.  This gives two limitations: firstly, longjumping out of
-wrappers will rapidly lead to disaster, since the shadow stack will
-not get correctly cleared.  Secondly, since the shadow stack has
-finite size, recursion between wrapper/replacement functions is only
-possible to a limited depth, beyond which Valgrind has to abort the
-run.  This depth is currently 16 calls.</para>
+  <listitem>
+   <para>As of version 3.2.0, Valgrind has the following limitations
+   in its implementation of PPC32 and PPC64 floating point 
+   arithmetic, relative to IEEE754.</para>
 
-<para>For all platforms ({x86,amd64,ppc32,ppc64}-linux) all the above
-comments apply on a per-thread basis.  In other words, wrapping is
-thread-safe: each thread must individually observe the above
-restrictions, but there is no need for any kind of inter-thread
-cooperation.</para>
-</sect2>
+   <para>Scalar (non-Altivec): Valgrind provides a bit-exact emulation of
+   all floating point instructions, except for "fre" and "fres", which are
+   done more precisely than required by the PowerPC architecture specification.
+   All floating point operations observe the current rounding mode.
+   </para>
 
+   <para>However, fpscr[FPRF] is not set after each operation.  That could
+   be done but would give measurable performance overheads, and so far
+   no need for it has been found.</para>
 
-<sect2 id="manual-core.wrapping.limitations-sigs" 
-       xreflabel="Limitations - original function signatures">
-<title>Limitations - original function signatures</title>
+   <para>As on x86/AMD64, IEEE754 exceptions are not supported: all floating
+   point exceptions are handled using the default IEEE fixup actions.
+   Valgrind detects, ignores, and can warn about, attempts to unmask 
+   the 5 IEEE FP exception kinds by writing to the floating-point status 
+   and control register (fpscr).
+   </para>
 
-<para>As shown in the above example, to call the original you must use a
-macro of the form <computeroutput>CALL_FN_*</computeroutput>.  
-For technical reasons it is impossible
-to create a single macro to deal with all argument types and numbers,
-so a family of macros covering the most common cases is supplied.  In
-what follows, 'W' denotes a machine-word-typed value (a pointer or a
-C <computeroutput>long</computeroutput>), 
-and 'v' denotes C's <computeroutput>void</computeroutput> type.
-The currently available macros are:</para>
-
-<programlisting><![CDATA[
-CALL_FN_v_v       -- call an original of type  void fn ( void )
-CALL_FN_W_v       -- call an original of type  long fn ( void )
-
-CALL_FN_v_W       -- void fn ( long )
-CALL_FN_W_W       -- long fn ( long )
-
-CALL_FN_v_WW      -- void fn ( long, long )
-CALL_FN_W_WW      -- long fn ( long, long )
+   <para>Vector (Altivec, VMX): essentially as with x86/AMD64 SSE/SSE2: 
+   no exceptions, and limited observance of rounding mode.  
+   For Altivec, FP arithmetic
+   is done in IEEE/Java mode, which is more accurate than the Linux default
+   setting.  "More accurate" means that denormals are handled properly, 
+   rather than simply being flushed to zero.</para>
+  </listitem>
+ </itemizedlist>
 
-CALL_FN_v_WWW     -- void fn ( long, long, long )
-CALL_FN_W_WWW     -- long fn ( long, long, long )
+ <para>Programs which are known not to work are:</para>
+ <itemizedlist>
+  <listitem>
+   <para>emacs starts up but immediately concludes it is out of
+   memory and aborts.  It may be that Memcheck does not provide
+   a good enough emulation of the 
+   <computeroutput>mallinfo</computeroutput> function.
+   Emacs works fine if you build it to use
+   the standard malloc/free routines.</para>
+  </listitem>
+ </itemizedlist>
 
-CALL_FN_W_WWWW    -- long fn ( long, long, long, long )
-CALL_FN_W_5W      -- long fn ( long, long, long, long, long )
-CALL_FN_W_6W      -- long fn ( long, long, long, long, long, long )
-and so on, up to 
-CALL_FN_W_12W
-]]></programlisting>
+</sect1>
 
-<para>The set of supported types can be expanded as needed.  It is
-regrettable that this limitation exists.  Function wrapping has proven
-difficult to implement, with a certain apparently unavoidable level of
-ickyness.  After several implementation attempts, the present
-arrangement appears to be the least-worst tradeoff.  At least it works
-reliably in the presence of dynamic linking and dynamic code
-loading/unloading.</para>
 
-<para>You should not attempt to wrap a function of one type signature with a
-wrapper of a different type signature.  Such trickery will surely lead
-to crashes or strange behaviour.  This is not of course a limitation
-of the function wrapping implementation, merely a reflection of the
-fact that it gives you sweeping powers to shoot yourself in the foot
-if you are not careful.  Imagine the instant havoc you could wreak by
-writing a wrapper which matched any function name in any soname - in
-effect, one which claimed to be a wrapper for all functions in the
-process.</para>
-</sect2>
+<sect1 id="manual-core.example" xreflabel="An Example Run">
+<title>An Example Run</title>
 
-<sect2 id="manual-core.wrapping.examples" xreflabel="Examples">
-<title>Examples</title>
+<para>This is the log for a run of a small program using Memcheck.
+The program is in fact correct, and the reported error is as the
+result of a potentially serious code generation bug in GNU g++
+(snapshot 20010527).</para>
 
-<para>In the source tree, 
-<computeroutput>memcheck/tests/wrap[1-8].c</computeroutput> provide a series of
-examples, ranging from very simple to quite advanced.</para>
+<programlisting><![CDATA[
+sewardj@phoenix:~/newmat10$ ~/Valgrind-6/valgrind -v ./bogon 
+==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
+==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
+==25832== Startup, with flags:
+==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
+==25832== reading syms from /lib/ld-linux.so.2
+==25832== reading syms from /lib/libc.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
+==25832== reading syms from /lib/libm.so.6
+==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
+==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
+==25832== reading syms from /proc/self/exe
+==25832== 
+==25832== Invalid read of size 4
+==25832==    at 0x8048724: BandMatrix::ReSize(int,int,int) (bogon.cpp:45)
+==25832==    by 0x80487AF: main (bogon.cpp:66)
+==25832==  Address 0xBFFFF74C is not stack'd, malloc'd or free'd
+==25832==
+==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
+==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
+==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
+==25832== For a detailed leak analysis, rerun with: --leak-check=yes
+]]></programlisting>
 
-<para><computeroutput>auxprogs/libmpiwrap.c</computeroutput> is an example 
-of wrapping a big, complex API (the MPI-2 interface).  This file defines 
-almost 300 different wrappers.</para>
-</sect2>
+<para>The GCC folks fixed this about a week before gcc-3.0
+shipped.</para>
 
 </sect1>
 
 
+<sect1 id="manual-core.warnings" xreflabel="Warning Messages">
+<title>Warning Messages You Might See</title>
 
-<sect1 id="manual-core.install" xreflabel="Building and Installing">
-<title>Building and Installing Valgrind</title>
-
-<para>We use the standard Unix
-<computeroutput>./configure</computeroutput>,
-<computeroutput>make</computeroutput>, <computeroutput>make
-install</computeroutput> mechanism, and we have attempted to
-ensure that it works on machines with kernel 2.4 or 2.6 and glibc
-2.2.X to 2.5.X.  Once you have completed 
-<computeroutput>make install</computeroutput> you may then want 
-to run the regression tests
-with <computeroutput>make regtest</computeroutput>.
-</para>
+<para>Most of these only appear if you run in verbose mode
+(enabled by <computeroutput>-v</computeroutput>):</para>
 
-<para>There are five options (in addition to the usual
-<option>--prefix=</option> which affect how Valgrind is built:
-<itemizedlist>
+ <itemizedlist>
 
   <listitem>
-    <para><option>--enable-inner</option></para>
-    <para>This builds Valgrind with some special magic hacks which make
-     it possible to run it on a standard build of Valgrind (what the
-     developers call "self-hosting").  Ordinarily you should not use
-     this flag as various kinds of safety checks are disabled.
-   </para>
+    <para><computeroutput>More than 100 errors detected.  Subsequent
+    errors will still be recorded, but in less detail than
+    before.</computeroutput></para>
+
+    <para>After 100 different errors have been shown, Valgrind becomes
+    more conservative about collecting them.  It then requires only the
+    program counters in the top two stack frames to match when deciding
+    whether or not two errors are really the same one.  Prior to this
+    point, the PCs in the top four frames are required to match.  This
+    hack has the effect of slowing down the appearance of new errors
+    after the first 100.  The 100 constant can be changed by recompiling
+    Valgrind.</para>
   </listitem>
 
   <listitem>
-    <para><option>--enable-tls</option></para>
-    <para>TLS (Thread Local Storage) is a relatively new mechanism which
-    requires compiler, linker and kernel support.  Valgrind tries to
-    automatically test if TLS is supported and if so enables this option.
-    Sometimes it cannot test for TLS, so this option allows you to
-    override the automatic test.</para>
+    <para><computeroutput>More than 1000 errors detected.  I'm not
+    reporting any more.  Final error counts may be inaccurate.  Go fix
+    your program!</computeroutput></para>
+
+    <para>After 1000 different errors have been detected, Valgrind
+    ignores any more.  It seems unlikely that collecting even more
+    different ones would be of practical help to anybody, and it avoids
+    the danger that Valgrind spends more and more of its time comparing
+    new errors against an ever-growing collection.  As above, the 1000
+    number is a compile-time constant.</para>
   </listitem>
 
   <listitem>
-    <para><option>--with-vex=</option></para>
-    <para>Specifies the path to the underlying VEX dynamic-translation
-     library.  By default this is taken to be in the VEX directory off
-     the root of the source tree.
-   </para>
+    <para><computeroutput>Warning: client switching stacks?</computeroutput></para>
+
+    <para>Valgrind spotted such a large change in the stack pointer
+    that it guesses the client is switching to
+    a different stack.  At this point it makes a kludgey guess where the
+    base of the new stack is, and sets memory permissions accordingly.
+    You may get many bogus error messages following this, if Valgrind
+    guesses wrong.  At the moment "large change" is defined as a change
+    of more that 2000000 in the value of the
+    stack pointer register.</para>
   </listitem>
 
   <listitem>
-    <para><option>--enable-only64bit</option></para>
-    <para><option>--enable-only32bit</option></para>
-    <para>On 64-bit
-     platforms (amd64-linux, ppc64-linux), Valgrind is by default built
-     in such a way that both 32-bit and 64-bit executables can be run.
-     Sometimes this cleverness is a problem for a variety of reasons.
-     These two flags allow for single-target builds in this situation.
-     If you issue both, the configure script will complain.  Note they
-     are ignored on 32-bit-only platforms (x86-linux, ppc32-linux).
-   </para>
+    <para><computeroutput>Warning: client attempted to close Valgrind's
+    logfile fd &lt;number&gt;</computeroutput></para>
+
+    <para>Valgrind doesn't allow the client to close the logfile,
+    because you'd never see any diagnostic information after that point.
+    If you see this message, you may want to use the
+    <option>--log-fd=&lt;number&gt;</option> option to specify a
+    different logfile file-descriptor number.</para>
   </listitem>
 
-</itemizedlist>
-</para>
+  <listitem>
+    <para><computeroutput>Warning: noted but unhandled ioctl
+    &lt;number&gt;</computeroutput></para>
 
-<para>The <computeroutput>configure</computeroutput> script tests
-the version of the X server currently indicated by the current
-<computeroutput>$DISPLAY</computeroutput>.  This is a known bug.
-The intention was to detect the version of the current X
-client libraries, so that correct suppressions could be selected
-for them, but instead the test checks the server version.  This
-is just plain wrong.</para>
+    <para>Valgrind observed a call to one of the vast family of
+    <computeroutput>ioctl</computeroutput> system calls, but did not
+    modify its memory status info (because nobody has yet written a 
+    suitable wrapper).  The call will still have gone through, but you may get
+    spurious errors after this as a result of the non-update of the
+    memory info.</para>
+  </listitem>
 
-<para>If you are building a binary package of Valgrind for
-distribution, please read <literal>README_PACKAGERS</literal>
-<xref linkend="dist.readme-packagers"/>.  It contains some
-important information.</para>
+  <listitem>
+    <para><computeroutput>Warning: set address range perms: large range
+    &lt;number></computeroutput></para>
 
-<para>Apart from that, there's not much excitement here.  Let us
-know if you have build problems.</para>
+    <para>Diagnostic message, mostly for benefit of the Valgrind
+    developers, to do with memory permissions.</para>
+  </listitem>
+
+ </itemizedlist>
 
 </sect1>
 
 
 
-<sect1 id="manual-core.problems" xreflabel="If You Have Problems">
-<title>If You Have Problems</title>
+<sect1 id="manual-core.clientreq" 
+       xreflabel="The Client Request mechanism">
+<title>The Client Request mechanism</title>
 
-<para>Contact us at <ulink url="&vg-url;">&vg-url;</ulink>.</para>
+<para>Valgrind has a trapdoor mechanism via which the client
+program can pass all manner of requests and queries to Valgrind
+and the current tool.  Internally, this is used extensively to
+make malloc, free, etc, work, although you don't see that.</para>
 
-<para>See <xref linkend="manual-core.limits"/> for the known
-limitations of Valgrind, and for a list of programs which are
-known not to work on it.</para>
+<para>For your convenience, a subset of these so-called client
+requests is provided to allow you to tell Valgrind facts about
+the behaviour of your program, and also to make queries.
+In particular, your program can tell Valgrind about changes in
+memory range permissions that Valgrind would not otherwise know
+about, and so allows clients to get Valgrind to do arbitrary
+custom checks.</para>
 
-<para>All parts of the system make heavy use of assertions and 
-internal self-checks.  They are permanently enabled, and we have no 
-plans to disable them.  If one of them breaks, please mail us!</para>
+<para>Clients need to include a header file to make this work.
+Which header file depends on which client requests you use.  Some
+client requests are handled by the core, and are defined in the
+header file <filename>valgrind/valgrind.h</filename>.  Tool-specific
+header files are named after the tool, e.g.
+<filename>valgrind/memcheck.h</filename>.  All header files can be found
+in the <literal>include/valgrind</literal> directory of wherever Valgrind
+was installed.</para>
 
-<para>If you get an assertion failure
-in <filename>m_mallocfree.c</filename>, this may have happened because
-your program wrote off the end of a malloc'd block, or before its
-beginning.  Valgrind hopefully will have emitted a proper message to that
-effect before dying in this way.  This is a known problem which
-we should fix.</para>
-
-<para>Read the <xref linkend="FAQ"/> for more advice about common problems, 
-crashes, etc.</para>
-
-</sect1>
-
-
-
-<sect1 id="manual-core.limits" xreflabel="Limitations">
-<title>Limitations</title>
-
-<para>The following list of limitations seems long.  However, most
-programs actually work fine.</para>
-
-<para>Valgrind will run Linux ELF binaries, on a kernel 2.4.X or 2.6.X
-system, on the x86, amd64, ppc32 and ppc64 architectures, subject to the
-following constraints:</para>
-
- <itemizedlist>
-  <listitem>
-   <para>On x86 and amd64, there is no support for 3DNow! instructions.
-   If the translator encounters these, Valgrind will generate a SIGILL
-   when the instruction is executed.  Apart from that, on x86 and amd64,
-   essentially all instructions are supported, up to and including SSE3.
-   </para>
-
-   <para>On ppc32 and ppc64, almost all integer, floating point and Altivec
-   instructions are supported.  Specifically: integer and FP insns that are
-   mandatory for PowerPC, the "General-purpose optional" group (fsqrt, fsqrts,
-   stfiwx), the "Graphics optional" group (fre, fres, frsqrte, frsqrtes), and
-   the Altivec (also known as VMX) SIMD instruction set, are supported.</para>
-  </listitem>
-
-  <listitem>
-   <para>Atomic instruction sequences are not properly supported, in the
-   sense that their atomicity is not preserved.  This will affect any
-   use of synchronization via memory shared between processes.  They
-   will appear to work, but fail sporadically.</para>
-  </listitem>
-
-  <listitem>
-   <para>If your program does its own memory management, rather than
-   using malloc/new/free/delete, it should still work, but Valgrind's
-   error checking won't be so effective.  If you describe your program's
-   memory management scheme using "client requests" 
-   (see <xref linkend="manual-core.clientreq"/>), Memcheck can do
-   better.  Nevertheless, using malloc/new and free/delete is still the
-   best approach.</para>
-  </listitem>
-
-  <listitem>
-   <para>Valgrind's signal simulation is not as robust as it could be.
-   Basic POSIX-compliant sigaction and sigprocmask functionality is
-   supplied, but it's conceivable that things could go badly awry if you
-   do weird things with signals.  Workaround: don't.  Programs that do
-   non-POSIX signal tricks are in any case inherently unportable, so
-   should be avoided if possible.</para>
-  </listitem>
-
-  <listitem>
-   <para>Machine instructions, and system calls, have been implemented
-   on demand.  So it's possible, although unlikely, that a program will
-   fall over with a message to that effect.  If this happens, please
-   report all the details printed out, so we can try and implement the
-   missing feature.</para>
-  </listitem>
+<para>The macros in these header files have the magical property
+that they generate code in-line which Valgrind can spot.
+However, the code does nothing when not run on Valgrind, so you
+are not forced to run your program under Valgrind just because you
+use the macros in this file.  Also, you are not required to link your
+program with any extra supporting libraries.</para>
 
-  <listitem>
-   <para>Memory consumption of your program is majorly increased whilst
-   running under Valgrind.  This is due to the large amount of
-   administrative information maintained behind the scenes.  Another
-   cause is that Valgrind dynamically translates the original
-   executable.  Translated, instrumented code is 12-18 times larger than
-   the original so you can easily end up with 50+ MB of translations
-   when running (eg) a web browser.</para>
-  </listitem>
+<para>The code added to your binary has negligible performance impact:
+on x86, amd64, ppc32 and ppc64, the overhead is 6 simple integer instructions
+and is probably undetectable except in tight loops.
+However, if you really wish to compile out the client requests, you can
+compile with <computeroutput>-DNVALGRIND</computeroutput> (analogous to
+<computeroutput>-DNDEBUG</computeroutput>'s effect on
+<computeroutput>assert()</computeroutput>).
+</para>
 
-  <listitem>
-   <para>Valgrind can handle dynamically-generated code just fine.  If
-   you regenerate code over the top of old code (ie. at the same memory
-   addresses), if the code is on the stack Valgrind will realise the
-   code has changed, and work correctly.  This is necessary to handle
-   the trampolines GCC uses to implemented nested functions.  If you
-   regenerate code somewhere other than the stack, you will need to use
-   the <option>--smc-check=all</option> flag, and Valgrind will run more
-   slowly than normal.</para>
-  </listitem>
+<para>You are encouraged to copy the <filename>valgrind/*.h</filename> headers
+into your project's include directory, so your program doesn't have a
+compile-time dependency on Valgrind being installed.  The Valgrind headers,
+unlike most of the rest of the code, are under a BSD-style license so you may
+include them without worrying about license incompatibility.</para>
 
-  <listitem>
-   <para>As of version 3.0.0, Valgrind has the following limitations
-   in its implementation of x86/AMD64 floating point relative to 
-   IEEE754.</para>
+<para>Here is a brief description of the macros available in
+<filename>valgrind.h</filename>, which work with more than one
+tool (see the tool-specific documentation for explanations of the
+tool-specific macros).</para>
 
-   <para>Precision: There is no support for 80 bit arithmetic.
-   Internally, Valgrind represents all such "long double" numbers in 64
-   bits, and so there may be some differences in results.  Whether or
-   not this is critical remains to be seen.  Note, the x86/amd64
-   fldt/fstpt instructions (read/write 80-bit numbers) are correctly
-   simulated, using conversions to/from 64 bits, so that in-memory
-   images of 80-bit numbers look correct if anyone wants to see.</para>
+ <variablelist>
 
-   <para>The impression observed from many FP regression tests is that
-   the accuracy differences aren't significant.  Generally speaking, if
-   a program relies on 80-bit precision, there may be difficulties
-   porting it to non x86/amd64 platforms which only support 64-bit FP
-   precision.  Even on x86/amd64, the program may get different results
-   depending on whether it is compiled to use SSE2 instructions (64-bits
-   only), or x87 instructions (80-bit).  The net effect is to make FP
-   programs behave as if they had been run on a machine with 64-bit IEEE
-   floats, for example PowerPC.  On amd64 FP arithmetic is done by
-   default on SSE2, so amd64 looks more like PowerPC than x86 from an FP
-   perspective, and there are far fewer noticeable accuracy differences
-   than with x86.</para>
+  <varlistentry>
+   <term><command><computeroutput>RUNNING_ON_VALGRIND</computeroutput></command>:</term>
+   <listitem>
+    <para>Returns 1 if running on Valgrind, 0 if running on the
+    real CPU.  If you are running Valgrind on itself, returns the
+    number of layers of Valgrind emulation you're running on.
+    </para>
+   </listitem>
+  </varlistentry>
 
-   <para>Rounding: Valgrind does observe the 4 IEEE-mandated rounding
-   modes (to nearest, to +infinity, to -infinity, to zero) for the
-   following conversions: float to integer, integer to float where
-   there is a possibility of loss of precision, and float-to-float
-   rounding.  For all other FP operations, only the IEEE default mode
-   (round to nearest) is supported.</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_DISCARD_TRANSLATIONS</computeroutput>:</command></term>
+   <listitem>
+    <para>Discards translations of code in the specified address
+    range.  Useful if you are debugging a JIT compiler or some other
+    dynamic code generation system.  After this call, attempts to
+    execute code in the invalidated address range will cause
+    Valgrind to make new translations of that code, which is
+    probably the semantics you want.  Note that code invalidations
+    are expensive because finding all the relevant translations
+    quickly is very difficult.  So try not to call it often.
+    Note that you can be clever about
+    this: you only need to call it when an area which previously
+    contained code is overwritten with new code.  You can choose
+    to write code into fresh memory, and just call this
+    occasionally to discard large chunks of old code all at
+    once.</para>
+    <para>
+    Alternatively, for transparent self-modifying-code support,
+    use<computeroutput>--smc-check=all</computeroutput>, or run
+    on ppc32/Linux or ppc64/Linux.
+    </para>
+   </listitem>
+  </varlistentry>
 
-   <para>Numeric exceptions in FP code: IEEE754 defines five types of
-   numeric exception that can happen: invalid operation (sqrt of
-   negative number, etc), division by zero, overflow, underflow,
-   inexact (loss of precision).</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_COUNT_ERRORS</computeroutput>:</command></term>
+   <listitem>
+    <para>Returns the number of errors found so far by Valgrind.  Can be
+    useful in test harness code when combined with the
+    <option>--log-fd=-1</option> option; this runs Valgrind silently,
+    but the client program can detect when errors occur.  Only useful
+    for tools that report errors, e.g. it's useful for Memcheck, but for
+    Cachegrind it will always return zero because Cachegrind doesn't
+    report errors.</para>
+   </listitem>
+  </varlistentry>
 
-   <para>For each exception, two courses of action are defined by IEEE754:
-   either (1) a user-defined exception handler may be called, or (2) a
-   default action is defined, which "fixes things up" and allows the
-   computation to proceed without throwing an exception.</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>:</command></term>
+   <listitem>
+    <para>If your program manages its own memory instead of using
+    the standard <computeroutput>malloc()</computeroutput> /
+    <computeroutput>new</computeroutput> /
+    <computeroutput>new[]</computeroutput>, tools that track
+    information about heap blocks will not do nearly as good a
+    job.  For example, Memcheck won't detect nearly as many
+    errors, and the error messages won't be as informative.  To
+    improve this situation, use this macro just after your custom
+    allocator allocates some new memory.  See the comments in
+    <filename>valgrind.h</filename> for information on how to use
+    it.</para>
+   </listitem>
+  </varlistentry>
 
-   <para>Currently Valgrind only supports the default fixup actions.
-   Again, feedback on the importance of exception support would be
-   appreciated.</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_FREELIKE_BLOCK</computeroutput>:</command></term>
+   <listitem>
+    <para>This should be used in conjunction with
+    <computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>.
+    Again, see <filename>memcheck/memcheck.h</filename> for
+    information on how to use it.</para>
+   </listitem>
+  </varlistentry>
 
-   <para>When Valgrind detects that the program is trying to exceed any
-   of these limitations (setting exception handlers, rounding mode, or
-   precision control), it can print a message giving a traceback of
-   where this has happened, and continue execution.  This behaviour used
-   to be the default, but the messages are annoying and so showing them
-   is now disabled by default.  Use <option>--show-emwarns=yes</option> to see
-   them.</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>:</command></term>
+   <listitem>
+    <para>This is similar to
+    <computeroutput>VALGRIND_MALLOCLIKE_BLOCK</computeroutput>,
+    but is tailored towards code that uses memory pools.  See the
+    comments in <filename>valgrind.h</filename> for information
+    on how to use it.</para>
+   </listitem>
+  </varlistentry>
+  
+  <varlistentry>
+  <term><command><computeroutput>VALGRIND_DESTROY_MEMPOOL</computeroutput>:</command></term>
+   <listitem>
+    <para>This should be used in conjunction with
+    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
+    Again, see the comments in <filename>valgrind.h</filename> for
+    information on how to use it.</para>
+   </listitem>
+  </varlistentry>
 
-   <para>The above limitations define precisely the IEEE754 'default'
-   behaviour: default fixup on all exceptions, round-to-nearest
-   operations, and 64-bit precision.</para>
-  </listitem>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_MEMPOOL_ALLOC</computeroutput>:</command></term>
+   <listitem>
+    <para>This should be used in conjunction with
+    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
+    Again, see the comments in <filename>valgrind.h</filename> for
+    information on how to use it.</para>
+   </listitem>
+  </varlistentry>
    
-  <listitem>
-   <para>As of version 3.0.0, Valgrind has the following limitations in
-   its implementation of x86/AMD64 SSE2 FP arithmetic, relative to 
-   IEEE754.</para>
-
-   <para>Essentially the same: no exceptions, and limited observance of
-   rounding mode.  Also, SSE2 has control bits which make it treat
-   denormalised numbers as zero (DAZ) and a related action, flush
-   denormals to zero (FTZ).  Both of these cause SSE2 arithmetic to be
-   less accurate than IEEE requires.  Valgrind detects, ignores, and can
-   warn about, attempts to enable either mode.</para>
-  </listitem>
-
-  <listitem>
-   <para>As of version 3.2.0, Valgrind has the following limitations
-   in its implementation of PPC32 and PPC64 floating point 
-   arithmetic, relative to IEEE754.</para>
-
-   <para>Scalar (non-Altivec): Valgrind provides a bit-exact emulation of
-   all floating point instructions, except for "fre" and "fres", which are
-   done more precisely than required by the PowerPC architecture specification.
-   All floating point operations observe the current rounding mode.
-   </para>
-
-   <para>However, fpscr[FPRF] is not set after each operation.  That could
-   be done but would give measurable performance overheads, and so far
-   no need for it has been found.</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_MEMPOOL_FREE</computeroutput>:</command></term>
+   <listitem>
+    <para>This should be used in conjunction with
+    <computeroutput>VALGRIND_CREATE_MEMPOOL</computeroutput>.
+    Again, see the comments in <filename>valgrind.h</filename> for
+    information on how to use it.</para>
+   </listitem>
+  </varlistentry>
 
-   <para>As on x86/AMD64, IEEE754 exceptions are not supported: all floating
-   point exceptions are handled using the default IEEE fixup actions.
-   Valgrind detects, ignores, and can warn about, attempts to unmask 
-   the 5 IEEE FP exception kinds by writing to the floating-point status 
-   and control register (fpscr).
-   </para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_NON_SIMD_CALL[0123]</computeroutput>:</command></term>
+   <listitem>
+    <para>Executes a function of 0, 1, 2 or 3 args in the client
+    program on the <emphasis>real</emphasis> CPU, not the virtual
+    CPU that Valgrind normally runs code on.  These are used in
+    various ways internally to Valgrind.  They might be useful to
+    client programs.</para> 
 
-   <para>Vector (Altivec, VMX): essentially as with x86/AMD64 SSE/SSE2: 
-   no exceptions, and limited observance of rounding mode.  
-   For Altivec, FP arithmetic
-   is done in IEEE/Java mode, which is more accurate than the Linux default
-   setting.  "More accurate" means that denormals are handled properly, 
-   rather than simply being flushed to zero.</para>
-  </listitem>
- </itemizedlist>
+    <para><command>Warning:</command> Only use these if you
+    <emphasis>really</emphasis> know what you are doing.</para>
+   </listitem>
+  </varlistentry>
 
- <para>Programs which are known not to work are:</para>
- <itemizedlist>
-  <listitem>
-   <para>emacs starts up but immediately concludes it is out of
-   memory and aborts.  It may be that Memcheck does not provide
-   a good enough emulation of the 
-   <computeroutput>mallinfo</computeroutput> function.
-   Emacs works fine if you build it to use
-   the standard malloc/free routines.</para>
-  </listitem>
- </itemizedlist>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_PRINTF(format, ...)</computeroutput>:</command></term>
+   <listitem>
+    <para>printf a message to the log file when running under
+    Valgrind.  Nothing is output if not running under Valgrind.
+    Returns the number of characters output.</para>
+   </listitem>
+  </varlistentry>
 
-</sect1>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_PRINTF_BACKTRACE(format, ...)</computeroutput>:</command></term>
+   <listitem>
+    <para>printf a message to the log file along with a stack
+    backtrace when running under Valgrind.  Nothing is output if
+    not running under Valgrind.  Returns the number of characters
+    output.</para>
+   </listitem>
+  </varlistentry>
 
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_STACK_REGISTER(start, end)</computeroutput>:</command></term>
+   <listitem>
+    <para>Registers a new stack.  Informs Valgrind that the memory range
+    between start and end is a unique stack.  Returns a stack identifier
+    that can be used with other
+    <computeroutput>VALGRIND_STACK_*</computeroutput> calls.</para>
+    <para>Valgrind will use this information to determine if a change to
+    the stack pointer is an item pushed onto the stack or a change over
+    to a new stack.  Use this if you're using a user-level thread package
+    and are noticing spurious errors from Valgrind about uninitialized
+    memory reads.</para>
+   </listitem>
+  </varlistentry>
 
-<sect1 id="manual-core.example" xreflabel="An Example Run">
-<title>An Example Run</title>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_STACK_DEREGISTER(id)</computeroutput>:</command></term>
+   <listitem>
+    <para>Deregisters a previously registered stack.  Informs
+    Valgrind that previously registered memory range with stack id
+    <computeroutput>id</computeroutput> is no longer a stack.</para>
+   </listitem>
+  </varlistentry>
 
-<para>This is the log for a run of a small program using Memcheck.
-The program is in fact correct, and the reported error is as the
-result of a potentially serious code generation bug in GNU g++
-(snapshot 20010527).</para>
+  <varlistentry>
+   <term><command><computeroutput>VALGRIND_STACK_CHANGE(id, start, end)</computeroutput>:</command></term>
+   <listitem>
+    <para>Changes a previously registered stack.  Informs
+    Valgrind that the previously registered stack with stack id
+    <computeroutput>id</computeroutput> has changed its start and end
+    values.  Use this if your user-level thread package implements
+    stack growth.</para>
+   </listitem>
+  </varlistentry>
 
-<programlisting><![CDATA[
-sewardj@phoenix:~/newmat10$ ~/Valgrind-6/valgrind -v ./bogon 
-==25832== Valgrind 0.10, a memory error detector for x86 RedHat 7.1.
-==25832== Copyright (C) 2000-2001, and GNU GPL'd, by Julian Seward.
-==25832== Startup, with flags:
-==25832== --suppressions=/home/sewardj/Valgrind/redhat71.supp
-==25832== reading syms from /lib/ld-linux.so.2
-==25832== reading syms from /lib/libc.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libgcc_s.so.0
-==25832== reading syms from /lib/libm.so.6
-==25832== reading syms from /mnt/pima/jrs/Inst/lib/libstdc++.so.3
-==25832== reading syms from /home/sewardj/Valgrind/valgrind.so
-==25832== reading syms from /proc/self/exe
-==25832== 
-==25832== Invalid read of size 4
-==25832==    at 0x8048724: _ZN10BandMatrix6ReSizeEiii (bogon.cpp:45)
-==25832==    by 0x80487AF: main (bogon.cpp:66)
-==25832==  Address 0xBFFFF74C is not stack'd, malloc'd or free'd
-==25832==
-==25832== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
-==25832== malloc/free: in use at exit: 0 bytes in 0 blocks.
-==25832== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
-==25832== For a detailed leak analysis, rerun with: --leak-check=yes
-==25832==
-==25832== exiting, did 1881 basic blocks, 0 misses.
-==25832== 223 translations, 3626 bytes in, 56801 bytes out.]]></programlisting>
+ </variablelist>
 
-<para>The GCC folks fixed this about a week before gcc-3.0
-shipped.</para>
+<para>Note that <filename>valgrind.h</filename> is included by
+all the tool-specific header files (such as
+<filename>memcheck.h</filename>), so you don't need to include it
+in your client if you include a tool-specific header.</para>
 
 </sect1>
 
 
-<sect1 id="manual-core.warnings" xreflabel="Warning Messages">
-<title>Warning Messages You Might See</title>
 
-<para>Most of these only appear if you run in verbose mode
-(enabled by <computeroutput>-v</computeroutput>):</para>
 
- <itemizedlist>
 
-  <listitem>
-    <para><computeroutput>More than 100 errors detected.  Subsequent
-    errors will still be recorded, but in less detail than
-    before.</computeroutput></para>
+<sect1 id="manual-core.wrapping" xreflabel="Function Wrapping">
+<title>Function wrapping</title>
 
-    <para>After 100 different errors have been shown, Valgrind becomes
-    more conservative about collecting them.  It then requires only the
-    program counters in the top two stack frames to match when deciding
-    whether or not two errors are really the same one.  Prior to this
-    point, the PCs in the top four frames are required to match.  This
-    hack has the effect of slowing down the appearance of new errors
-    after the first 100.  The 100 constant can be changed by recompiling
-    Valgrind.</para>
-  </listitem>
+<para>
+Valgrind versions 3.2.0 and above can do function wrapping on all
+supported targets.  In function wrapping, calls to some specified
+function are intercepted and rerouted to a different, user-supplied
+function.  This can do whatever it likes, typically examining the
+arguments, calling onwards to the original, and possibly examining the
+result.  Any number of functions may be wrapped.</para>
 
-  <listitem>
-    <para><computeroutput>More than 1000 errors detected.  I'm not
-    reporting any more.  Final error counts may be inaccurate.  Go fix
-    your program!</computeroutput></para>
+<para>
+Function wrapping is useful for instrumenting an API in some way.  For
+example, wrapping functions in the POSIX pthreads API makes it
+possible to notify Valgrind of thread status changes, and wrapping
+functions in the MPI (message-passing) API allows notifying Valgrind
+of memory status changes associated with message arrival/departure.
+Such information is usually passed to Valgrind by using client
+requests in the wrapper functions, although that is not of relevance
+here.</para>
 
-    <para>After 1000 different errors have been detected, Valgrind
-    ignores any more.  It seems unlikely that collecting even more
-    different ones would be of practical help to anybody, and it avoids
-    the danger that Valgrind spends more and more of its time comparing
-    new errors against an ever-growing collection.  As above, the 1000
-    number is a compile-time constant.</para>
-  </listitem>
+<sect2 id="manual-core.wrapping.example" xreflabel="A Simple Example">
+<title>A Simple Example</title>
 
-  <listitem>
-    <para><computeroutput>Warning: client switching stacks?</computeroutput></para>
+<para>Supposing we want to wrap some function</para>
 
-    <para>Valgrind spotted such a large change in the stack pointer
-    that it guesses the client is switching to
-    a different stack.  At this point it makes a kludgey guess where the
-    base of the new stack is, and sets memory permissions accordingly.
-    You may get many bogus error messages following this, if Valgrind
-    guesses wrong.  At the moment "large change" is defined as a change
-    of more that 2000000 in the value of the
-    stack pointer register.</para>
-  </listitem>
+<programlisting><![CDATA[
+int foo ( int x, int y ) { return x + y; }]]></programlisting>
 
-  <listitem>
-    <para><computeroutput>Warning: client attempted to close Valgrind's
-    logfile fd &lt;number&gt;</computeroutput></para>
+<para>A wrapper is a function of identical type, but with a special name
+which identifies it as the wrapper for <computeroutput>foo</computeroutput>.
+Wrappers need to include
+supporting macros from <computeroutput>valgrind.h</computeroutput>.
+Here is a simple wrapper which prints the arguments and return value:</para>
 
-    <para>Valgrind doesn't allow the client to close the logfile,
-    because you'd never see any diagnostic information after that point.
-    If you see this message, you may want to use the
-    <option>--log-fd=&lt;number&gt;</option> option to specify a
-    different logfile file-descriptor number.</para>
-  </listitem>
+<programlisting><![CDATA[
+#include <stdio.h>
+#include "valgrind.h"
+int I_WRAP_SONAME_FNNAME_ZU(NONE,foo)( int x, int y )
+{
+   int    result;
+   OrigFn fn;
+   VALGRIND_GET_ORIG_FN(fn);
+   printf("foo's wrapper: args %d %d\n", x, y);
+   CALL_FN_W_WW(result, fn, x,y);
+   printf("foo's wrapper: result %d\n", result);
+   return result;
+}
+]]></programlisting>
 
-  <listitem>
-    <para><computeroutput>Warning: noted but unhandled ioctl
-    &lt;number&gt;</computeroutput></para>
+<para>To become active, the wrapper merely needs to be present in a text
+section somewhere in the same process' address space as the function
+it wraps, and for its ELF symbol name to be visible to Valgrind.  In
+practice, this means either compiling to a 
+<computeroutput>.o</computeroutput> and linking it in, or
+compiling to a <computeroutput>.so</computeroutput> and 
+<computeroutput>LD_PRELOAD</computeroutput>ing it in.  The latter is more
+convenient in that it doesn't require relinking.</para>
 
-    <para>Valgrind observed a call to one of the vast family of
-    <computeroutput>ioctl</computeroutput> system calls, but did not
-    modify its memory status info (because nobody has yet written a 
-    suitable wrapper).  The call will still have gone through, but you may get
-    spurious errors after this as a result of the non-update of the
-    memory info.</para>
-  </listitem>
+<para>All wrappers have approximately the above form.  There are three
+crucial macros:</para>
 
-  <listitem>
-    <para><computeroutput>Warning: set address range perms: large range
-    &lt;number></computeroutput></para>
+<para><computeroutput>I_WRAP_SONAME_FNNAME_ZU</computeroutput>: 
+this generates the real name of the wrapper.
+This is an encoded name which Valgrind notices when reading symbol
+table information.  What it says is: I am the wrapper for any function
+named <computeroutput>foo</computeroutput> which is found in 
+an ELF shared object with an empty
+("<computeroutput>NONE</computeroutput>") soname field.  The specification 
+mechanism is powerful in
+that wildcards are allowed for both sonames and function names.  
+The details are discussed below.</para>
 
-    <para>Diagnostic message, mostly for benefit of the Valgrind
-    developers, to do with memory permissions.</para>
-  </listitem>
+<para><computeroutput>VALGRIND_GET_ORIG_FN</computeroutput>: 
+once in the the wrapper, the first priority is
+to get hold of the address of the original (and any other supporting
+information needed).  This is stored in a value of opaque 
+type <computeroutput>OrigFn</computeroutput>.
+The information is acquired using 
+<computeroutput>VALGRIND_GET_ORIG_FN</computeroutput>.  It is crucial
+to make this macro call before calling any other wrapped function
+in the same thread.</para>
 
- </itemizedlist>
+<para><computeroutput>CALL_FN_W_WW</computeroutput>: eventually we will
+want to call the function being
+wrapped.  Calling it directly does not work, since that just gets us
+back to the wrapper and tends to kill the program in short order by
+stack overflow.  Instead, the result lvalue, 
+<computeroutput>OrigFn</computeroutput> and arguments are
+handed to one of a family of macros of the form 
+<computeroutput>CALL_FN_*</computeroutput>.  These
+cause Valgrind to call the original and avoid recursion back to the
+wrapper.</para>
+</sect2>
+
+<sect2 id="manual-core.wrapping.specs" xreflabel="Wrapping Specifications">
+<title>Wrapping Specifications</title>
+
+<para>This scheme has the advantage of being self-contained.  A library of
+wrappers can be compiled to object code in the normal way, and does
+not rely on an external script telling Valgrind which wrappers pertain
+to which originals.</para>
 
-</sect1>
+<para>Each wrapper has a name which, in the most general case says: I am the
+wrapper for any function whose name matches FNPATT and whose ELF
+"soname" matches SOPATT.  Both FNPATT and SOPATT may contain wildcards
+(asterisks) and other characters (spaces, dots, @, etc) which are not 
+generally regarded as valid C identifier names.</para> 
 
+<para>This flexibility is needed to write robust wrappers for POSIX pthread
+functions, where typically we are not completely sure of either the
+function name or the soname, or alternatively we want to wrap a whole
+set of functions at once.</para> 
 
-<sect1 id="manual-core.mpiwrap" xreflabel="MPI Wrappers">
-<title>Debugging MPI Parallel Programs with Valgrind</title>
-
-<para> Valgrind supports debugging of distributed-memory applications
-which use the MPI message passing standard.  This support consists of a
-library of wrapper functions for the
-<computeroutput>PMPI_*</computeroutput> interface.  When incorporated
-into the application's address space, either by direct linking or by
-<computeroutput>LD_PRELOAD</computeroutput>, the wrappers intercept
-calls to <computeroutput>PMPI_Send</computeroutput>,
-<computeroutput>PMPI_Recv</computeroutput>, etc.  They then
-use client requests to inform Valgrind of memory state changes caused
-by the function being wrapped.  This reduces the number of false
-positives that Memcheck otherwise typically reports for MPI
-applications.</para>
-
-<para>The wrappers also take the opportunity to carefully check
-size and definedness of buffers passed as arguments to MPI functions, hence
-detecting errors such as passing undefined data to
-<computeroutput>PMPI_Send</computeroutput>, or receiving data into a
-buffer which is too small.</para>
-
-<para>Unlike most of the rest of Valgrind, the wrapper library is subject to a
-BSD-style license, so you can link it into any code base you like.
-See the top of <computeroutput>auxprogs/libmpiwrap.c</computeroutput>
-for license details.</para>
-
-
-<sect2 id="manual-core.mpiwrap.build" xreflabel="Building MPI Wrappers">
-<title>Building and installing the wrappers</title>
-
-<para> The wrapper library will be built automatically if possible.
-Valgrind's configure script will look for a suitable
-<computeroutput>mpicc</computeroutput> to build it with.  This must be
-the same <computeroutput>mpicc</computeroutput> you use to build the
-MPI application you want to debug.  By default, Valgrind tries
-<computeroutput>mpicc</computeroutput>, but you can specify a
-different one by using the configure-time flag
-<computeroutput>--with-mpicc=</computeroutput>.  Currently the
-wrappers are only buildable with
-<computeroutput>mpicc</computeroutput>s which are based on GNU
-<computeroutput>gcc</computeroutput> or Intel's
-<computeroutput>icc</computeroutput>.</para>
-
-<para>Check that the configure script prints a line like this:</para>
+<para>For example, <computeroutput>pthread_create</computeroutput> 
+in GNU libpthread is usually a
+versioned symbol - one whose name ends in, eg, 
+<computeroutput>@GLIBC_2.3</computeroutput>.  Hence we
+are not sure what its real name is.  We also want to cover any soname
+of the form <computeroutput>libpthread.so*</computeroutput>.
+So the header of the wrapper will be</para>
 
 <programlisting><![CDATA[
-checking for usable MPI2-compliant mpicc and mpi.h... yes, mpicc
+int I_WRAP_SONAME_FNNAME_ZZ(libpthreadZdsoZd0,pthreadZucreateZAZa)
+  ( ... formals ... )
+  { ... body ... }
 ]]></programlisting>
 
-<para>If it says <computeroutput>... no</computeroutput>, your
-<computeroutput>mpicc</computeroutput> has failed to compile and link
-a test MPI2 program.</para>
-
-<para>If the configure test succeeds, continue in the usual way with
-<computeroutput>make</computeroutput> and <computeroutput>make
-install</computeroutput>.  The final install tree should then contain
-<computeroutput>libmpiwrap.so</computeroutput>.
-</para>
-
-<para>Compile up a test MPI program (eg, MPI hello-world) and try
-this:</para>
+<para>In order to write unusual characters as valid C function names, a
+Z-encoding scheme is used.  Names are written literally, except that
+a capital Z acts as an escape character, with the following encoding:</para>
 
 <programlisting><![CDATA[
-LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so   \
-           mpirun [args] $prefix/bin/valgrind ./hello
+     Za   encodes    *
+     Zp              +
+     Zc              :
+     Zd              .
+     Zu              _
+     Zh              -
+     Zs              (space)
+     ZA              @
+     ZZ              Z
+     ZL              (       # only in valgrind 3.3.0 and later
+     ZR              )       # only in valgrind 3.3.0 and later
 ]]></programlisting>
 
-<para>You should see something similar to the following</para>
+<para>Hence <computeroutput>libpthreadZdsoZd0</computeroutput> is an 
+encoding of the soname <computeroutput>libpthread.so.0</computeroutput>
+and <computeroutput>pthreadZucreateZAZa</computeroutput> is an encoding 
+of the function name <computeroutput>pthread_create@*</computeroutput>.
+</para>
 
-<programlisting><![CDATA[
-valgrind MPI wrappers 31901: Active for pid 31901
-valgrind MPI wrappers 31901: Try MPIWRAP_DEBUG=help for possible options
-]]></programlisting>
+<para>The macro <computeroutput>I_WRAP_SONAME_FNNAME_ZZ</computeroutput> 
+constructs a wrapper name in which
+both the soname (first component) and function name (second component)
+are Z-encoded.  Encoding the function name can be tiresome and is
+often unnecessary, so a second macro,
+<computeroutput>I_WRAP_SONAME_FNNAME_ZU</computeroutput>, can be
+used instead.  The <computeroutput>_ZU</computeroutput> variant is 
+also useful for writing wrappers for
+C++ functions, in which the function name is usually already mangled
+using some other convention in which Z plays an important role.  Having
+to encode a second time quickly becomes confusing.</para>
 
-<para>repeated for every process in the group.  If you do not see
-these, there is an build/installation problem of some kind.</para>
+<para>Since the function name field may contain wildcards, it can be
+anything, including just <computeroutput>*</computeroutput>.
+The same is true for the soname.
+However, some ELF objects - specifically, main executables - do not
+have sonames.  Any object lacking a soname is treated as if its soname
+was <computeroutput>NONE</computeroutput>, which is why the original 
+example above had a name
+<computeroutput>I_WRAP_SONAME_FNNAME_ZU(NONE,foo)</computeroutput>.</para>
 
-<para> The MPI functions to be wrapped are assumed to be in an ELF
-shared object with soname matching
-<computeroutput>libmpi.so*</computeroutput>.  This is known to be
-correct at least for Open MPI and Quadrics MPI, and can easily be
-changed if required.</para> 
+<para>Note that the soname of an ELF object is not the same as its
+file name, although it is often similar.  You can find the soname of
+an object <computeroutput>libfoo.so</computeroutput> using the command
+<computeroutput>readelf -a libfoo.so | grep soname</computeroutput>.</para>
 </sect2>
 
+<sect2 id="manual-core.wrapping.semantics" xreflabel="Wrapping Semantics">
+<title>Wrapping Semantics</title>
 
-<sect2 id="manual-core.mpiwrap.gettingstarted" 
-       xreflabel="Getting started with MPI Wrappers">
-<title>Getting started</title>
+<para>The ability for a wrapper to replace an infinite family of functions
+is powerful but brings complications in situations where ELF objects
+appear and disappear (are dlopen'd and dlclose'd) on the fly.
+Valgrind tries to maintain sensible behaviour in such situations.</para>
 
-<para>Compile your MPI application as usual, taking care to link it
-using the same <computeroutput>mpicc</computeroutput> that your
-Valgrind build was configured with.</para>
+<para>For example, suppose a process has dlopened (an ELF object with
+soname) <computeroutput>object1.so</computeroutput>, which contains 
+<computeroutput>function1</computeroutput>.  It starts to use
+<computeroutput>function1</computeroutput> immediately.</para>
 
-<para>
-Use the following basic scheme to run your application on Valgrind with
-the wrappers engaged:</para>
+<para>After a while it dlopens <computeroutput>wrappers.so</computeroutput>,
+which contains a wrapper
+for <computeroutput>function1</computeroutput> in (soname) 
+<computeroutput>object1.so</computeroutput>.  All subsequent calls to 
+<computeroutput>function1</computeroutput> are rerouted to the wrapper.</para>
 
-<programlisting><![CDATA[
-MPIWRAP_DEBUG=[wrapper-args]                                  \
-   LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so   \
-   mpirun [mpirun-args]                                       \
-   $prefix/bin/valgrind [valgrind-args]                       \
-   [application] [app-args]
-]]></programlisting>
+<para>If <computeroutput>wrappers.so</computeroutput> is 
+later dlclose'd, calls to <computeroutput>function1</computeroutput> are 
+naturally routed back to the original.</para>
+
+<para>Alternatively, if <computeroutput>object1.so</computeroutput>
+is dlclose'd but wrappers.so remains,
+then the wrapper exported by <computeroutput>wrapper.so</computeroutput>
+becomes inactive, since there
+is no way to get to it - there is no original to call any more.  However,
+Valgrind remembers that the wrapper is still present.  If 
+<computeroutput>object1.so</computeroutput> is
+eventually dlopen'd again, the wrapper will become active again.</para>
+
+<para>In short, valgrind inspects all code loading/unloading events to
+ensure that the set of currently active wrappers remains consistent.</para>
 
-<para>As an alternative to
-<computeroutput>LD_PRELOAD</computeroutput>ing
-<computeroutput>libmpiwrap.so</computeroutput>, you can simply link it
-to your application if desired.  This should not disturb native
-behaviour of your application in any way.</para>
+<para>A second possible problem is that of conflicting wrappers.  It is 
+easily possible to load two or more wrappers, both of which claim
+to be wrappers for some third function.  In such cases Valgrind will
+complain about conflicting wrappers when the second one appears, and
+will honour only the first one.</para>
 </sect2>
 
+<sect2 id="manual-core.wrapping.debugging" xreflabel="Debugging">
+<title>Debugging</title>
 
-<sect2 id="manual-core.mpiwrap.controlling" 
-       xreflabel="Controlling the MPI Wrappers">
-<title>Controlling the wrapper library</title>
+<para>Figuring out what's going on given the dynamic nature of wrapping
+can be difficult.  The 
+<computeroutput>--trace-redir=yes</computeroutput> flag makes 
+this possible
+by showing the complete state of the redirection subsystem after
+every
+<computeroutput>mmap</computeroutput>/<computeroutput>munmap</computeroutput>
+event affecting code (text).</para>
 
-<para>Environment variable
-<computeroutput>MPIWRAP_DEBUG</computeroutput> is consulted at
-startup.  The default behaviour is to print a starting banner</para>
+<para>There are two central concepts:</para>
 
-<programlisting><![CDATA[
-valgrind MPI wrappers 16386: Active for pid 16386
-valgrind MPI wrappers 16386: Try MPIWRAP_DEBUG=help for possible options
-]]></programlisting>
+<itemizedlist>
 
-<para> and then be relatively quiet.</para>
+  <listitem><para>A "redirection specification" is a binding of 
+  a (soname pattern, fnname pattern) pair to a code address.
+  These bindings are created by writing functions with names
+  made with the 
+  <computeroutput>I_WRAP_SONAME_FNNAME_{ZZ,_ZU}</computeroutput>
+  macros.</para></listitem>
 
-<para>You can give a list of comma-separated options in
-<computeroutput>MPIWRAP_DEBUG</computeroutput>.  These are</para>
+  <listitem><para>An "active redirection" is code-address to 
+  code-address binding currently in effect.</para></listitem>
 
-<itemizedlist>
-  <listitem>
-    <para><computeroutput>verbose</computeroutput>:
-    show entries/exits of all wrappers.  Also show extra
-    debugging info, such as the status of outstanding 
-    <computeroutput>MPI_Request</computeroutput>s resulting
-    from uncompleted <computeroutput>MPI_Irecv</computeroutput>s.</para>
-  </listitem>
-  <listitem>
-    <para><computeroutput>quiet</computeroutput>: 
-    opposite of <computeroutput>verbose</computeroutput>, only print 
-    anything when the wrappers want
-    to report a detected programming error, or in case of catastrophic
-    failure of the wrappers.</para>
-  </listitem>
-  <listitem>
-    <para><computeroutput>warn</computeroutput>: 
-    by default, functions which lack proper wrappers
-    are not commented on, just silently
-    ignored.  This causes a warning to be printed for each unwrapped
-    function used, up to a maximum of three warnings per function.</para>
-  </listitem>
-  <listitem>
-    <para><computeroutput>strict</computeroutput>: 
-    print an error message and abort the program if 
-    a function lacking a wrapper is used.</para>
-  </listitem>
 </itemizedlist>
 
-<para> If you want to use Valgrind's XML output facility
-(<computeroutput>--xml=yes</computeroutput>), you should pass
-<computeroutput>quiet</computeroutput> in
-<computeroutput>MPIWRAP_DEBUG</computeroutput> so as to get rid of any
-extraneous printing from the wrappers.</para>
+<para>The state of the wrapping-and-redirection subsystem comprises a set of
+specifications and a set of active bindings.  The specifications are
+acquired/discarded by watching all 
+<computeroutput>mmap</computeroutput>/<computeroutput>munmap</computeroutput>
+events on code (text)
+sections.  The active binding set is (conceptually) recomputed from
+the specifications, and all known symbol names, following any change
+to the specification set.</para>
 
-</sect2>
+<para><computeroutput>--trace-redir=yes</computeroutput> shows the contents 
+of both sets following any such event.</para>
 
+<para><computeroutput>-v</computeroutput> prints a line of text each 
+time an active specification is used for the first time.</para>
 
-<sect2 id="manual-core.mpiwrap.limitations" 
-       xreflabel="Abilities and Limitations of MPI Wrappers">
-<title>Abilities and limitations</title>
+<para>Hence for maximum debugging effectiveness you will need to use both
+flags.</para>
 
-<sect3>
-<title>Functions</title>
+<para>One final comment.  The function-wrapping facility is closely
+tied to Valgrind's ability to replace (redirect) specified
+functions, for example to redirect calls to 
+<computeroutput>malloc</computeroutput> to its
+own implementation.  Indeed, a replacement function can be
+regarded as a wrapper function which does not call the original.
+However, to make the implementation more robust, the two kinds
+of interception (wrapping vs replacement) are treated differently.
+</para>
 
-<para>All MPI2 functions except
-<computeroutput>MPI_Wtick</computeroutput>,
-<computeroutput>MPI_Wtime</computeroutput> and
-<computeroutput>MPI_Pcontrol</computeroutput> have wrappers.  The
-first two are not wrapped because they return a 
-<computeroutput>double</computeroutput>, and Valgrind's
-function-wrap mechanism cannot handle that (it could easily enough be
-extended to).  <computeroutput>MPI_Pcontrol</computeroutput> cannot be
-wrapped as it has variable arity: 
-<computeroutput>int MPI_Pcontrol(const int level, ...)</computeroutput></para>
+<para><computeroutput>--trace-redir=yes</computeroutput> shows 
+specifications and bindings for both
+replacement and wrapper functions.  To differentiate the 
+two, replacement bindings are printed using 
+<computeroutput>R-></computeroutput> whereas 
+wraps are printed using <computeroutput>W-></computeroutput>.
+</para>
+</sect2>
 
-<para>Most functions are wrapped with a default wrapper which does
-nothing except complain or abort if it is called, depending on
-settings in <computeroutput>MPIWRAP_DEBUG</computeroutput> listed
-above.  The following functions have "real", do-something-useful
-wrappers:</para>
 
-<programlisting><![CDATA[
-PMPI_Send PMPI_Bsend PMPI_Ssend PMPI_Rsend
+<sect2 id="manual-core.wrapping.limitations-cf" 
+       xreflabel="Limitations - control flow">
+<title>Limitations - control flow</title>
 
-PMPI_Recv PMPI_Get_count
+<para>For the most part, the function wrapping implementation is robust.
+The only important caveat is: in a wrapper, get hold of
+the <computeroutput>OrigFn</computeroutput> information using 
+<computeroutput>VALGRIND_GET_ORIG_FN</computeroutput> before calling any
+other wrapped function.  Once you have the 
+<computeroutput>OrigFn</computeroutput>, arbitrary
+calls between, recursion between, and longjumps out of wrappers
+should work correctly.  There is never any interaction between wrapped
+functions and merely replaced functions 
+(eg <computeroutput>malloc</computeroutput>), so you can call
+<computeroutput>malloc</computeroutput> etc safely from within wrappers.
+</para>
 
-PMPI_Isend PMPI_Ibsend PMPI_Issend PMPI_Irsend
+<para>The above comments are true for {x86,amd64,ppc32}-linux.  On
+ppc64-linux function wrapping is more fragile due to the (arguably
+poorly designed) ppc64-linux ABI.  This mandates the use of a shadow
+stack which tracks entries/exits of both wrapper and replacement
+functions.  This gives two limitations: firstly, longjumping out of
+wrappers will rapidly lead to disaster, since the shadow stack will
+not get correctly cleared.  Secondly, since the shadow stack has
+finite size, recursion between wrapper/replacement functions is only
+possible to a limited depth, beyond which Valgrind has to abort the
+run.  This depth is currently 16 calls.</para>
 
-PMPI_Irecv
-PMPI_Wait PMPI_Waitall
-PMPI_Test PMPI_Testall
+<para>For all platforms ({x86,amd64,ppc32,ppc64}-linux) all the above
+comments apply on a per-thread basis.  In other words, wrapping is
+thread-safe: each thread must individually observe the above
+restrictions, but there is no need for any kind of inter-thread
+cooperation.</para>
+</sect2>
 
-PMPI_Iprobe PMPI_Probe
 
-PMPI_Cancel
+<sect2 id="manual-core.wrapping.limitations-sigs" 
+       xreflabel="Limitations - original function signatures">
+<title>Limitations - original function signatures</title>
 
-PMPI_Sendrecv
+<para>As shown in the above example, to call the original you must use a
+macro of the form <computeroutput>CALL_FN_*</computeroutput>.  
+For technical reasons it is impossible
+to create a single macro to deal with all argument types and numbers,
+so a family of macros covering the most common cases is supplied.  In
+what follows, 'W' denotes a machine-word-typed value (a pointer or a
+C <computeroutput>long</computeroutput>), 
+and 'v' denotes C's <computeroutput>void</computeroutput> type.
+The currently available macros are:</para>
 
-PMPI_Type_commit PMPI_Type_free
+<programlisting><![CDATA[
+CALL_FN_v_v       -- call an original of type  void fn ( void )
+CALL_FN_W_v       -- call an original of type  long fn ( void )
 
-PMPI_Pack PMPI_Unpack
+CALL_FN_v_W       -- void fn ( long )
+CALL_FN_W_W       -- long fn ( long )
 
-PMPI_Bcast PMPI_Gather PMPI_Scatter PMPI_Alltoall
-PMPI_Reduce PMPI_Allreduce PMPI_Op_create
+CALL_FN_v_WW      -- void fn ( long, long )
+CALL_FN_W_WW      -- long fn ( long, long )
 
-PMPI_Comm_create PMPI_Comm_dup PMPI_Comm_free PMPI_Comm_rank PMPI_Comm_size
+CALL_FN_v_WWW     -- void fn ( long, long, long )
+CALL_FN_W_WWW     -- long fn ( long, long, long )
 
-PMPI_Error_string
-PMPI_Init PMPI_Initialized PMPI_Finalize
+CALL_FN_W_WWWW    -- long fn ( long, long, long, long )
+CALL_FN_W_5W      -- long fn ( long, long, long, long, long )
+CALL_FN_W_6W      -- long fn ( long, long, long, long, long, long )
+and so on, up to 
+CALL_FN_W_12W
 ]]></programlisting>
 
-<para> A few functions such as
-<computeroutput>PMPI_Address</computeroutput> are listed as
-<computeroutput>HAS_NO_WRAPPER</computeroutput>.  They have no wrapper
-at all as there is nothing worth checking, and giving a no-op wrapper
-would reduce performance for no reason.</para>
-
-<para> Note that the wrapper library itself can itself generate large
-numbers of calls to the MPI implementation, especially when walking
-complex types.  The most common functions called are
-<computeroutput>PMPI_Extent</computeroutput>,
-<computeroutput>PMPI_Type_get_envelope</computeroutput>,
-<computeroutput>PMPI_Type_get_contents</computeroutput>, and
-<computeroutput>PMPI_Type_free</computeroutput>.  </para>
-</sect3>
-
-<sect3>
-<title>Types</title>
-
-<para> MPI-1.1 structured types are supported, and walked exactly.
-The currently supported combiners are
-<computeroutput>MPI_COMBINER_NAMED</computeroutput>,
-<computeroutput>MPI_COMBINER_CONTIGUOUS</computeroutput>,
-<computeroutput>MPI_COMBINER_VECTOR</computeroutput>,
-<computeroutput>MPI_COMBINER_HVECTOR</computeroutput>
-<computeroutput>MPI_COMBINER_INDEXED</computeroutput>,
-<computeroutput>MPI_COMBINER_HINDEXED</computeroutput> and
-<computeroutput>MPI_COMBINER_STRUCT</computeroutput>.  This should
-cover all MPI-1.1 types.  The mechanism (function
-<computeroutput>walk_type</computeroutput>) should extend easily to
-cover MPI2 combiners.</para>
-
-<para>MPI defines some named structured types
-(<computeroutput>MPI_FLOAT_INT</computeroutput>,
-<computeroutput>MPI_DOUBLE_INT</computeroutput>,
-<computeroutput>MPI_LONG_INT</computeroutput>,
-<computeroutput>MPI_2INT</computeroutput>,
-<computeroutput>MPI_SHORT_INT</computeroutput>,
-<computeroutput>MPI_LONG_DOUBLE_INT</computeroutput>) which are pairs
-of some basic type and a C <computeroutput>int</computeroutput>.
-Unfortunately the MPI specification makes it impossible to look inside
-these types and see where the fields are.  Therefore these wrappers
-assume the types are laid out as <computeroutput>struct { float val;
-int loc; }</computeroutput> (for
-<computeroutput>MPI_FLOAT_INT</computeroutput>), etc, and act
-accordingly.  This appears to be correct at least for Open MPI 1.0.2
-and for Quadrics MPI.</para>
-
-<para>If <computeroutput>strict</computeroutput> is an option specified 
-in <computeroutput>MPIWRAP_DEBUG</computeroutput>, the application
-will abort if an unhandled type is encountered.  Otherwise, the 
-application will print a warning message and continue.</para>
-
-<para>Some effort is made to mark/check memory ranges corresponding to
-arrays of values in a single pass.  This is important for performance
-since asking Valgrind to mark/check any range, no matter how small,
-carries quite a large constant cost.  This optimisation is applied to
-arrays of primitive types (<computeroutput>double</computeroutput>,
-<computeroutput>float</computeroutput>,
-<computeroutput>int</computeroutput>,
-<computeroutput>long</computeroutput>, <computeroutput>long
-long</computeroutput>, <computeroutput>short</computeroutput>,
-<computeroutput>char</computeroutput>, and <computeroutput>long
-double</computeroutput> on platforms where <computeroutput>sizeof(long
-double) == 8</computeroutput>).  For arrays of all other types, the
-wrappers handle each element individually and so there can be a very
-large performance cost.</para>
-
-</sect3>
+<para>The set of supported types can be expanded as needed.  It is
+regrettable that this limitation exists.  Function wrapping has proven
+difficult to implement, with a certain apparently unavoidable level of
+ickyness.  After several implementation attempts, the present
+arrangement appears to be the least-worst tradeoff.  At least it works
+reliably in the presence of dynamic linking and dynamic code
+loading/unloading.</para>
 
+<para>You should not attempt to wrap a function of one type signature with a
+wrapper of a different type signature.  Such trickery will surely lead
+to crashes or strange behaviour.  This is not of course a limitation
+of the function wrapping implementation, merely a reflection of the
+fact that it gives you sweeping powers to shoot yourself in the foot
+if you are not careful.  Imagine the instant havoc you could wreak by
+writing a wrapper which matched any function name in any soname - in
+effect, one which claimed to be a wrapper for all functions in the
+process.</para>
 </sect2>
 
+<sect2 id="manual-core.wrapping.examples" xreflabel="Examples">
+<title>Examples</title>
 
-<sect2 id="manual-core.mpiwrap.writingwrappers" 
-       xreflabel="Writing new MPI Wrappers">
-<title>Writing new wrappers</title>
+<para>In the source tree, 
+<computeroutput>memcheck/tests/wrap[1-8].c</computeroutput> provide a series of
+examples, ranging from very simple to quite advanced.</para>
 
-<para>
-For the most part the wrappers are straightforward.  The only
-significant complexity arises with nonblocking receives.</para>
-
-<para>The issue is that <computeroutput>MPI_Irecv</computeroutput>
-states the recv buffer and returns immediately, giving a handle
-(<computeroutput>MPI_Request</computeroutput>) for the transaction.
-Later the user will have to poll for completion with
-<computeroutput>MPI_Wait</computeroutput> etc, and when the
-transaction completes successfully, the wrappers have to paint the
-recv buffer.  But the recv buffer details are not presented to
-<computeroutput>MPI_Wait</computeroutput> -- only the handle is.  The
-library therefore maintains a shadow table which associates
-uncompleted <computeroutput>MPI_Request</computeroutput>s with the
-corresponding buffer address/count/type.  When an operation completes,
-the table is searched for the associated address/count/type info, and
-memory is marked accordingly.</para>
-
-<para>Access to the table is guarded by a (POSIX pthreads) lock, so as
-to make the library thread-safe.</para>
-
-<para>The table is allocated with
-<computeroutput>malloc</computeroutput> and never
-<computeroutput>free</computeroutput>d, so it will show up in leak
-checks.</para>
-
-<para>Writing new wrappers should be fairly easy.  The source file is
-<computeroutput>auxprogs/libmpiwrap.c</computeroutput>.  If possible,
-find an existing wrapper for a function of similar behaviour to the
-one you want to wrap, and use it as a starting point.  The wrappers
-are organised in sections in the same order as the MPI 1.1 spec, to
-aid navigation.  When adding a wrapper, remember to comment out the
-definition of the default wrapper in the long list of defaults at the
-bottom of the file (do not remove it, just comment it out).</para>
+<para><computeroutput>auxprogs/libmpiwrap.c</computeroutput> is an example 
+of wrapping a big, complex API (the MPI-2 interface).  This file defines 
+almost 300 different wrappers.</para>
 </sect2>
 
-<sect2 id="manual-core.mpiwrap.whattoexpect" 
-       xreflabel="What to expect with MPI Wrappers">
-<title>What to expect when using the wrappers</title>
-
-<para>The wrappers should reduce Memcheck's false-error rate on MPI
-applications.  Because the wrapping is done at the MPI interface,
-there will still potentially be a large number of errors reported in
-the MPI implementation below the interface.  The best you can do is
-try to suppress them.</para>
-
-<para>You may also find that the input-side (buffer
-length/definedness) checks find errors in your MPI use, for example
-passing too short a buffer to
-<computeroutput>MPI_Recv</computeroutput>.</para>
-
-<para>Functions which are not wrapped may increase the false
-error rate.  A possible approach is to run with
-<computeroutput>MPI_DEBUG</computeroutput> containing
-<computeroutput>warn</computeroutput>.  This will show you functions
-which lack proper wrappers but which are nevertheless used.  You can
-then write wrappers for them.
-</para>
+</sect1>
 
-<para>A known source of potential false errors are the
-<computeroutput>PMPI_Reduce</computeroutput> family of functions, when
-using a custom (user-defined) reduction function.  In a reduction
-operation, each node notionally sends data to a "central point" which
-uses the specified reduction function to merge the data items into a
-single item.  Hence, in general, data is passed between nodes and fed
-to the reduction function, but the wrapper library cannot mark the
-transferred data as initialised before it is handed to the reduction
-function, because all that happens "inside" the
-<computeroutput>PMPI_Reduce</computeroutput> call.  As a result you
-may see false positives reported in your reduction function.</para>
 
-</sect2>
 
-</sect1>
 
 </chapter>
diff --git a/docs/xml/manual-intro.xml b/docs/xml/manual-intro.xml
index 7a4152d0eb..a43fae5be8 100644
--- a/docs/xml/manual-intro.xml
+++ b/docs/xml/manual-intro.xml
@@ -11,7 +11,7 @@
 <para>Valgrind is a suite of simulation-based debugging and profiling
 tools for programs running on Linux (x86, amd64, ppc32 and ppc64).
 The system consists of a core, which provides a synthetic CPU in
-software, and a series of tools, each of which performs some kind of
+software, and a set of tools, each of which performs some kind of
 debugging, profiling, or similar task.  The architecture is modular,
 so that new tools can be created easily and without disturbing the
 existing structure.</para>
@@ -106,6 +106,30 @@ summary, these are:</para>
      paging needed.</para>
    </listitem>
 
+   <listitem>
+     <para><command>Helgrind</command> detects synchronisation errors
+     in programs that use the POSIX pthreads threading primitives.  It
+     detects the following three classes of errors:</para>
+
+     <itemizedlist>
+      <listitem>
+        <para>Misuses of the POSIX pthreads API.</para>
+      </listitem>
+      <listitem>
+        <para>Potential deadlocks arising from lock ordering
+        problems.</para>
+      </listitem>
+      <listitem>
+       <para>Data races -- accessing memory without adequate locking.</para>
+      </listitem>
+    </itemizedlist>
+
+    <para>Problems like these often result in unreproducible,
+    timing-dependent crashes, deadlocks and other misbehaviour, and
+    can be difficult to find by other means.</para>
+
+   </listitem>
+
 </orderedlist>
   
 
@@ -119,19 +143,22 @@ integer and floating point operations your program does.</para>
 
 <para>Valgrind is closely tied to details of the CPU and operating
 system, and to a lesser extent, the compiler and basic C libraries.
-Nonetheless, as of version 3.2.0 it supports several platforms:
+Nonetheless, as of version 3.3.0 it supports several platforms:
 x86/Linux (mature), amd64/Linux (maturing), ppc32/Linux and
-ppc64/Linux (less mature but work well).  Valgrind uses the standard Unix
+ppc64/Linux (less mature but work well).  There is also experimental
+support for ppc32/AIX5 and ppc64/AIX5 (AIX 5.2 and 5.3 only).
+Valgrind uses the standard Unix
 <computeroutput>./configure</computeroutput>,
 <computeroutput>make</computeroutput>, <computeroutput>make
 install</computeroutput> mechanism, and we have attempted to ensure that
 it works on machines with Linux kernel 2.4.X or 2.6.X and glibc
-2.2.X to 2.5.X.</para>
+2.2.X to 2.7.X.</para>
 
 <para>Valgrind is licensed under the <xref linkend="license.gpl"/>,
 version 2.  The <computeroutput>valgrind/*.h</computeroutput> headers
 that you may wish to include in your code (eg.
-<filename>valgrind.h</filename>, <filename>memcheck.h</filename>) are
+<filename>valgrind.h</filename>, <filename>memcheck.h</filename>,
+<filename>helgrind.h</filename>) are
 distributed under a BSD-style license, so you may include them in your
 code without worrying about license conflicts.  Some of the PThreads
 test cases, <filename>pth_*.c</filename>, are taken from "Pthreads
@@ -139,6 +166,13 @@ Programming" by Bradford Nichols, Dick Buttlar &amp; Jacqueline Proulx
 Farrell, ISBN 1-56592-115-1, published by O'Reilly &amp; Associates,
 Inc.</para>
 
+<para>If you contribute code to Valgrind, please ensure your
+contributions are licensed as "GPLv2, or (at your option) any later
+version."  This is so as to allow the possibility of easily upgrading
+the license to GPLv3 in future.  If you want to modify code in the VEX
+subdirectory, please also see VEX/HACKING.README.</para>
+
+
 </sect1>
 
 
@@ -158,11 +192,15 @@ want to run the Memcheck tool.  The final chapter explains how to write a
 new tool.</para>
 
 <para>Be aware that the core understands some command line flags, and
-the tools have their own flags which they know about.  This means there
-is no central place describing all the flags that are accepted -- you
-have to read the flags documentation both for 
+the tools have their own flags which they know about.  This means
+there is no central place describing all the flags that are
+accepted -- you have to read the flags documentation both for
 <xref linkend="manual-core"/> and for the tool you want to use.</para>
 
+<para>The manual is quite big and complex.  If you are looking for a
+quick getting-started guide, have a look at
+<xref linkend="quick-start"/>.</para>
+
 </sect1>
 
 </chapter>
diff --git a/docs/xml/quick-start-guide.xml b/docs/xml/quick-start-guide.xml
index 69655bdbf0..773871bb7e 100644
--- a/docs/xml/quick-start-guide.xml
+++ b/docs/xml/quick-start-guide.xml
@@ -32,24 +32,64 @@ memory errors such as:</para>
 
 <itemizedlist>
   <listitem>
-    <para>touching memory you shouldn't (eg. overrunning heap block
-    boundaries);</para>
+    <para>Touching memory you shouldn't (eg. overrunning heap block
+    boundaries, or reading/writing freed memory).</para>
   </listitem>
   <listitem>
-    <para>using values before they have been initialized;</para>
+    <para>Using values before they have been initialized.</para>
   </listitem>
   <listitem>
-    <para>incorrect freeing of memory, such as double-freeing heap
-    blocks;</para>
+    <para>Incorrect freeing of memory, such as double-freeing heap
+    blocks.</para>
   </listitem>
   <listitem>
-    <para>memory leaks.</para>
+    <para>Memory leaks.</para>
   </listitem>
 </itemizedlist>
 
+<para>Memcheck is only one of the tools in the Valgrind suite.
+Other tools you may find useful are:</para>
+
+<itemizedlist>
+  <listitem>
+    <para>Cachegrind: a profiling tool which produces detailed data on
+    cache (miss) and branch (misprediction) events.  Statistics are
+    gathered for the entire program, for each function, for each line
+    of code, and even for each instruction, if you need that level of
+    detail.</para>
+  </listitem>
+  <listitem>
+    <para>Callgrind: a heavyweight profiling tool similar to
+    Cachegrind, but which also shows cost relationships across
+    function calls.  Information gathered by Callgrind can be viewed
+    using the KCachegrind GUI.  KCachegrind is not part of the
+    Valgrind suite - it is part of the KDE Desktop Environment.</para>
+  </listitem>
+  <listitem>
+    <para>Massif: a space profiling tool.  It allows you to explore
+    in detail which parts of your program allocate memory.</para>
+  </listitem>
+  <listitem>
+    <para>Helgrind: a debugging tool for threaded programs.  Helgrind
+    looks for various kinds of synchronisation errors in code that uses
+    the POSIX PThreads API.</para>
+  </listitem>
+  <listitem>
+    <para>In addition, there are a number of "experimental" tools in
+    the codebase.  They can be distinguished by the "exp-" prefix on
+    their names.  Experimental tools are not subject to the same
+    quality control standards that apply to our production-grade tools
+    (Memcheck, Cachegrind, Callgrind, Massif and Helgrind).</para>
+  </listitem>
+</itemizedlist>
+
+<para>The rest of this guide discusses only the Memcheck tool.  For
+full documentation on the other tools, see the Valgrind User
+Manual.</para>
+
 <para>What follows is the minimum information you need to start
 detecting memory errors in your program with Memcheck.  Note that this
-guide applies to Valgrind version 2.4.0 and later.  Some of the
+guide applies to Valgrind version 3.3.0 and later.  Some of the
 information is not quite right for earlier versions.</para>
 
 </sect1>
@@ -162,8 +202,9 @@ Things to notice:
   </listitem>
 </itemizedlist>
 
-It's worth fixing errors in the order they are reported, as later errors
-can be caused by earlier errors.</para>
+It's worth fixing errors in the order they are reported, as later
+errors can be caused by earlier errors.  Failing to do this is a
+common cause of difficulty with Memcheck.</para>
 
 <para>Memory leak messages look like this:
 
@@ -219,6 +260,15 @@ that are allocated statically or on the stack.  But it should detect many
 errors that could crash your program (eg. cause a segmentation
 fault).</para>
 
+<para>Try to make your program so clean that Memcheck reports no
+errors.  Once you achieve this state, it is much easier to see when
+changes to the program cause Memcheck to report new errors.
+Experience from several years of Memcheck use shows that it is
+possible to make even huge programs run Memcheck-clean.  For example,
+large parts of KDE 3.5.X, and recent versions of OpenOffice.org
+(2.3.0) are Memcheck-clean, or very close to it.</para>
+
+
 </sect1>
 
 
diff --git a/docs/xml/tech-docs.xml b/docs/xml/tech-docs.xml
index 8615c1d807..552631331a 100644
--- a/docs/xml/tech-docs.xml
+++ b/docs/xml/tech-docs.xml
@@ -17,11 +17,14 @@
   </legalnotice>
 </bookinfo>
 
-  <xi:include href="../../memcheck/docs/mc-tech-docs.xml" parse="xml"  
+<!--  <xi:include href="../../memcheck/docs/mc-tech-docs.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
-  <xi:include href="../../callgrind/docs/cl-format.xml" parse="xml"  
+-->
+  <xi:include href="new-tech-docs.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
   <xi:include href="manual-writing-tools.xml" parse="xml"  
       xmlns:xi="http://www.w3.org/2001/XInclude" />
+  <xi:include href="../../callgrind/docs/cl-format.xml" parse="xml"  
+      xmlns:xi="http://www.w3.org/2001/XInclude" />
 
 </book>
diff --git a/docs/xml/vg-entities.xml b/docs/xml/vg-entities.xml
index 19f95e6127..d56e957a91 100644
--- a/docs/xml/vg-entities.xml
+++ b/docs/xml/vg-entities.xml
@@ -2,13 +2,13 @@
 <!ENTITY vg-url        "http://www.valgrind.org/">
 <!ENTITY vg-jemail     "julian@valgrind.org">
 <!ENTITY vg-vemail     "valgrind@valgrind.org">
-<!ENTITY vg-lifespan   "2000-2006">
+<!ENTITY vg-lifespan   "2000-2007">
 <!ENTITY vg-users-list "http://lists.sourceforge.net/lists/listinfo/valgrind-users">
 
 <!-- valgrind release + version stuff -->
 <!ENTITY rel-type    "Release">
-<!ENTITY rel-version "3.2.0">
-<!ENTITY rel-date    "7 June 2006">
+<!ENTITY rel-version "3.3.0">
+<!ENTITY rel-date    "7 December 2007">
 
 <!-- where the docs are installed -->
 <!ENTITY vg-doc-path  "/usr/share/doc/valgrind/html/index.html">
diff --git a/memcheck/docs/mc-manual.xml b/memcheck/docs/mc-manual.xml
index b8c36a7ce9..f8444c76a2 100644
--- a/memcheck/docs/mc-manual.xml
+++ b/memcheck/docs/mc-manual.xml
@@ -1287,6 +1287,393 @@ inform Memcheck about changes to the state of a mempool:</para>
 
 </itemizedlist>
 
+</sect1>
+
+
+
+
+
+
+
+<sect1 id="mc-manual.mpiwrap" xreflabel="MPI Wrappers">
+<title>Debugging MPI Parallel Programs with Valgrind</title>
+
+<para> Valgrind supports debugging of distributed-memory applications
+which use the MPI message passing standard.  This support consists of a
+library of wrapper functions for the
+<computeroutput>PMPI_*</computeroutput> interface.  When incorporated
+into the application's address space, either by direct linking or by
+<computeroutput>LD_PRELOAD</computeroutput>, the wrappers intercept
+calls to <computeroutput>PMPI_Send</computeroutput>,
+<computeroutput>PMPI_Recv</computeroutput>, etc.  They then
+use client requests to inform Valgrind of memory state changes caused
+by the function being wrapped.  This reduces the number of false
+positives that Memcheck otherwise typically reports for MPI
+applications.</para>
+
+<para>The wrappers also take the opportunity to carefully check
+size and definedness of buffers passed as arguments to MPI functions, hence
+detecting errors such as passing undefined data to
+<computeroutput>PMPI_Send</computeroutput>, or receiving data into a
+buffer which is too small.</para>
+
+<para>Unlike most of the rest of Valgrind, the wrapper library is subject to a
+BSD-style license, so you can link it into any code base you like.
+See the top of <computeroutput>auxprogs/libmpiwrap.c</computeroutput>
+for license details.</para>
+
+
+<sect2 id="mc-manual.mpiwrap.build" xreflabel="Building MPI Wrappers">
+<title>Building and installing the wrappers</title>
+
+<para> The wrapper library will be built automatically if possible.
+Valgrind's configure script will look for a suitable
+<computeroutput>mpicc</computeroutput> to build it with.  This must be
+the same <computeroutput>mpicc</computeroutput> you use to build the
+MPI application you want to debug.  By default, Valgrind tries
+<computeroutput>mpicc</computeroutput>, but you can specify a
+different one by using the configure-time flag
+<computeroutput>--with-mpicc=</computeroutput>.  Currently the
+wrappers are only buildable with
+<computeroutput>mpicc</computeroutput>s which are based on GNU
+<computeroutput>gcc</computeroutput> or Intel's
+<computeroutput>icc</computeroutput>.</para>
+
+<para>Check that the configure script prints a line like this:</para>
+
+<programlisting><![CDATA[
+checking for usable MPI2-compliant mpicc and mpi.h... yes, mpicc
+]]></programlisting>
+
+<para>If it says <computeroutput>... no</computeroutput>, your
+<computeroutput>mpicc</computeroutput> has failed to compile and link
+a test MPI2 program.</para>
+
+<para>If the configure test succeeds, continue in the usual way with
+<computeroutput>make</computeroutput> and <computeroutput>make
+install</computeroutput>.  The final install tree should then contain
+<computeroutput>libmpiwrap.so</computeroutput>.
+</para>
+
+<para>Compile up a test MPI program (eg, MPI hello-world) and try
+this:</para>
+
+<programlisting><![CDATA[
+LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so   \
+           mpirun [args] $prefix/bin/valgrind ./hello
+]]></programlisting>
+
+<para>You should see something similar to the following</para>
+
+<programlisting><![CDATA[
+valgrind MPI wrappers 31901: Active for pid 31901
+valgrind MPI wrappers 31901: Try MPIWRAP_DEBUG=help for possible options
+]]></programlisting>
+
+<para>repeated for every process in the group.  If you do not see
+these, there is an build/installation problem of some kind.</para>
+
+<para> The MPI functions to be wrapped are assumed to be in an ELF
+shared object with soname matching
+<computeroutput>libmpi.so*</computeroutput>.  This is known to be
+correct at least for Open MPI and Quadrics MPI, and can easily be
+changed if required.</para> 
+</sect2>
+
+
+<sect2 id="mc-manual.mpiwrap.gettingstarted" 
+       xreflabel="Getting started with MPI Wrappers">
+<title>Getting started</title>
+
+<para>Compile your MPI application as usual, taking care to link it
+using the same <computeroutput>mpicc</computeroutput> that your
+Valgrind build was configured with.</para>
+
+<para>
+Use the following basic scheme to run your application on Valgrind with
+the wrappers engaged:</para>
+
+<programlisting><![CDATA[
+MPIWRAP_DEBUG=[wrapper-args]                                  \
+   LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so   \
+   mpirun [mpirun-args]                                       \
+   $prefix/bin/valgrind [valgrind-args]                       \
+   [application] [app-args]
+]]></programlisting>
+
+<para>As an alternative to
+<computeroutput>LD_PRELOAD</computeroutput>ing
+<computeroutput>libmpiwrap.so</computeroutput>, you can simply link it
+to your application if desired.  This should not disturb native
+behaviour of your application in any way.</para>
+</sect2>
+
+
+<sect2 id="mc-manual.mpiwrap.controlling" 
+       xreflabel="Controlling the MPI Wrappers">
+<title>Controlling the wrapper library</title>
+
+<para>Environment variable
+<computeroutput>MPIWRAP_DEBUG</computeroutput> is consulted at
+startup.  The default behaviour is to print a starting banner</para>
+
+<programlisting><![CDATA[
+valgrind MPI wrappers 16386: Active for pid 16386
+valgrind MPI wrappers 16386: Try MPIWRAP_DEBUG=help for possible options
+]]></programlisting>
+
+<para> and then be relatively quiet.</para>
+
+<para>You can give a list of comma-separated options in
+<computeroutput>MPIWRAP_DEBUG</computeroutput>.  These are</para>
+
+<itemizedlist>
+  <listitem>
+    <para><computeroutput>verbose</computeroutput>:
+    show entries/exits of all wrappers.  Also show extra
+    debugging info, such as the status of outstanding 
+    <computeroutput>MPI_Request</computeroutput>s resulting
+    from uncompleted <computeroutput>MPI_Irecv</computeroutput>s.</para>
+  </listitem>
+  <listitem>
+    <para><computeroutput>quiet</computeroutput>: 
+    opposite of <computeroutput>verbose</computeroutput>, only print 
+    anything when the wrappers want
+    to report a detected programming error, or in case of catastrophic
+    failure of the wrappers.</para>
+  </listitem>
+  <listitem>
+    <para><computeroutput>warn</computeroutput>: 
+    by default, functions which lack proper wrappers
+    are not commented on, just silently
+    ignored.  This causes a warning to be printed for each unwrapped
+    function used, up to a maximum of three warnings per function.</para>
+  </listitem>
+  <listitem>
+    <para><computeroutput>strict</computeroutput>: 
+    print an error message and abort the program if 
+    a function lacking a wrapper is used.</para>
+  </listitem>
+</itemizedlist>
+
+<para> If you want to use Valgrind's XML output facility
+(<computeroutput>--xml=yes</computeroutput>), you should pass
+<computeroutput>quiet</computeroutput> in
+<computeroutput>MPIWRAP_DEBUG</computeroutput> so as to get rid of any
+extraneous printing from the wrappers.</para>
+
+</sect2>
+
+
+<sect2 id="mc-manual.mpiwrap.limitations" 
+       xreflabel="Abilities and Limitations of MPI Wrappers">
+<title>Abilities and limitations</title>
+
+<sect3 id="mc-manual.mpiwrap.limitations.functions" 
+       xreflabel="Functions">
+<title>Functions</title>
+
+<para>All MPI2 functions except
+<computeroutput>MPI_Wtick</computeroutput>,
+<computeroutput>MPI_Wtime</computeroutput> and
+<computeroutput>MPI_Pcontrol</computeroutput> have wrappers.  The
+first two are not wrapped because they return a 
+<computeroutput>double</computeroutput>, and Valgrind's
+function-wrap mechanism cannot handle that (it could easily enough be
+extended to).  <computeroutput>MPI_Pcontrol</computeroutput> cannot be
+wrapped as it has variable arity: 
+<computeroutput>int MPI_Pcontrol(const int level, ...)</computeroutput></para>
+
+<para>Most functions are wrapped with a default wrapper which does
+nothing except complain or abort if it is called, depending on
+settings in <computeroutput>MPIWRAP_DEBUG</computeroutput> listed
+above.  The following functions have "real", do-something-useful
+wrappers:</para>
+
+<programlisting><![CDATA[
+PMPI_Send PMPI_Bsend PMPI_Ssend PMPI_Rsend
+
+PMPI_Recv PMPI_Get_count
+
+PMPI_Isend PMPI_Ibsend PMPI_Issend PMPI_Irsend
+
+PMPI_Irecv
+PMPI_Wait PMPI_Waitall
+PMPI_Test PMPI_Testall
+
+PMPI_Iprobe PMPI_Probe
+
+PMPI_Cancel
+
+PMPI_Sendrecv
+
+PMPI_Type_commit PMPI_Type_free
+
+PMPI_Pack PMPI_Unpack
+
+PMPI_Bcast PMPI_Gather PMPI_Scatter PMPI_Alltoall
+PMPI_Reduce PMPI_Allreduce PMPI_Op_create
+
+PMPI_Comm_create PMPI_Comm_dup PMPI_Comm_free PMPI_Comm_rank PMPI_Comm_size
+
+PMPI_Error_string
+PMPI_Init PMPI_Initialized PMPI_Finalize
+]]></programlisting>
+
+<para> A few functions such as
+<computeroutput>PMPI_Address</computeroutput> are listed as
+<computeroutput>HAS_NO_WRAPPER</computeroutput>.  They have no wrapper
+at all as there is nothing worth checking, and giving a no-op wrapper
+would reduce performance for no reason.</para>
+
+<para> Note that the wrapper library itself can itself generate large
+numbers of calls to the MPI implementation, especially when walking
+complex types.  The most common functions called are
+<computeroutput>PMPI_Extent</computeroutput>,
+<computeroutput>PMPI_Type_get_envelope</computeroutput>,
+<computeroutput>PMPI_Type_get_contents</computeroutput>, and
+<computeroutput>PMPI_Type_free</computeroutput>.  </para>
+</sect3>
+
+<sect3 id="mc-manual.mpiwrap.limitations.types" 
+       xreflabel="Types">
+<title>Types</title>
+
+<para> MPI-1.1 structured types are supported, and walked exactly.
+The currently supported combiners are
+<computeroutput>MPI_COMBINER_NAMED</computeroutput>,
+<computeroutput>MPI_COMBINER_CONTIGUOUS</computeroutput>,
+<computeroutput>MPI_COMBINER_VECTOR</computeroutput>,
+<computeroutput>MPI_COMBINER_HVECTOR</computeroutput>
+<computeroutput>MPI_COMBINER_INDEXED</computeroutput>,
+<computeroutput>MPI_COMBINER_HINDEXED</computeroutput> and
+<computeroutput>MPI_COMBINER_STRUCT</computeroutput>.  This should
+cover all MPI-1.1 types.  The mechanism (function
+<computeroutput>walk_type</computeroutput>) should extend easily to
+cover MPI2 combiners.</para>
+
+<para>MPI defines some named structured types
+(<computeroutput>MPI_FLOAT_INT</computeroutput>,
+<computeroutput>MPI_DOUBLE_INT</computeroutput>,
+<computeroutput>MPI_LONG_INT</computeroutput>,
+<computeroutput>MPI_2INT</computeroutput>,
+<computeroutput>MPI_SHORT_INT</computeroutput>,
+<computeroutput>MPI_LONG_DOUBLE_INT</computeroutput>) which are pairs
+of some basic type and a C <computeroutput>int</computeroutput>.
+Unfortunately the MPI specification makes it impossible to look inside
+these types and see where the fields are.  Therefore these wrappers
+assume the types are laid out as <computeroutput>struct { float val;
+int loc; }</computeroutput> (for
+<computeroutput>MPI_FLOAT_INT</computeroutput>), etc, and act
+accordingly.  This appears to be correct at least for Open MPI 1.0.2
+and for Quadrics MPI.</para>
+
+<para>If <computeroutput>strict</computeroutput> is an option specified 
+in <computeroutput>MPIWRAP_DEBUG</computeroutput>, the application
+will abort if an unhandled type is encountered.  Otherwise, the 
+application will print a warning message and continue.</para>
+
+<para>Some effort is made to mark/check memory ranges corresponding to
+arrays of values in a single pass.  This is important for performance
+since asking Valgrind to mark/check any range, no matter how small,
+carries quite a large constant cost.  This optimisation is applied to
+arrays of primitive types (<computeroutput>double</computeroutput>,
+<computeroutput>float</computeroutput>,
+<computeroutput>int</computeroutput>,
+<computeroutput>long</computeroutput>, <computeroutput>long
+long</computeroutput>, <computeroutput>short</computeroutput>,
+<computeroutput>char</computeroutput>, and <computeroutput>long
+double</computeroutput> on platforms where <computeroutput>sizeof(long
+double) == 8</computeroutput>).  For arrays of all other types, the
+wrappers handle each element individually and so there can be a very
+large performance cost.</para>
+
+</sect3>
+
+</sect2>
+
+
+<sect2 id="mc-manual.mpiwrap.writingwrappers" 
+       xreflabel="Writing new MPI Wrappers">
+<title>Writing new wrappers</title>
+
+<para>
+For the most part the wrappers are straightforward.  The only
+significant complexity arises with nonblocking receives.</para>
+
+<para>The issue is that <computeroutput>MPI_Irecv</computeroutput>
+states the recv buffer and returns immediately, giving a handle
+(<computeroutput>MPI_Request</computeroutput>) for the transaction.
+Later the user will have to poll for completion with
+<computeroutput>MPI_Wait</computeroutput> etc, and when the
+transaction completes successfully, the wrappers have to paint the
+recv buffer.  But the recv buffer details are not presented to
+<computeroutput>MPI_Wait</computeroutput> -- only the handle is.  The
+library therefore maintains a shadow table which associates
+uncompleted <computeroutput>MPI_Request</computeroutput>s with the
+corresponding buffer address/count/type.  When an operation completes,
+the table is searched for the associated address/count/type info, and
+memory is marked accordingly.</para>
+
+<para>Access to the table is guarded by a (POSIX pthreads) lock, so as
+to make the library thread-safe.</para>
+
+<para>The table is allocated with
+<computeroutput>malloc</computeroutput> and never
+<computeroutput>free</computeroutput>d, so it will show up in leak
+checks.</para>
+
+<para>Writing new wrappers should be fairly easy.  The source file is
+<computeroutput>auxprogs/libmpiwrap.c</computeroutput>.  If possible,
+find an existing wrapper for a function of similar behaviour to the
+one you want to wrap, and use it as a starting point.  The wrappers
+are organised in sections in the same order as the MPI 1.1 spec, to
+aid navigation.  When adding a wrapper, remember to comment out the
+definition of the default wrapper in the long list of defaults at the
+bottom of the file (do not remove it, just comment it out).</para>
+</sect2>
+
+<sect2 id="mc-manual.mpiwrap.whattoexpect" 
+       xreflabel="What to expect with MPI Wrappers">
+<title>What to expect when using the wrappers</title>
+
+<para>The wrappers should reduce Memcheck's false-error rate on MPI
+applications.  Because the wrapping is done at the MPI interface,
+there will still potentially be a large number of errors reported in
+the MPI implementation below the interface.  The best you can do is
+try to suppress them.</para>
+
+<para>You may also find that the input-side (buffer
+length/definedness) checks find errors in your MPI use, for example
+passing too short a buffer to
+<computeroutput>MPI_Recv</computeroutput>.</para>
+
+<para>Functions which are not wrapped may increase the false
+error rate.  A possible approach is to run with
+<computeroutput>MPI_DEBUG</computeroutput> containing
+<computeroutput>warn</computeroutput>.  This will show you functions
+which lack proper wrappers but which are nevertheless used.  You can
+then write wrappers for them.
+</para>
+
+<para>A known source of potential false errors are the
+<computeroutput>PMPI_Reduce</computeroutput> family of functions, when
+using a custom (user-defined) reduction function.  In a reduction
+operation, each node notionally sends data to a "central point" which
+uses the specified reduction function to merge the data items into a
+single item.  Hence, in general, data is passed between nodes and fed
+to the reduction function, but the wrapper library cannot mark the
+transferred data as initialised before it is handed to the reduction
+function, because all that happens "inside" the
+<computeroutput>PMPI_Reduce</computeroutput> call.  As a result you
+may see false positives reported in your reduction function.</para>
+
+</sect2>
 
 </sect1>
+
+
+
+
+
 </chapter>