</sect1>
+
+<sect1 id="manual-core.mpiwrap" xreflabel="MPI Wrappers">
+<title>Debugging MPI Parallel Programs with Valgrind</title>
+
+<para> Valgrind supports debugging of distributed-memory applications
+which use the MPI message passing standard. This support consists of a
+library of wrapper functions for the
+<computeroutput>PMPI_*</computeroutput> interface. When incorporated
+into the application's address space, either by direct linking or by
+<computeroutput>LD_PRELOAD</computeroutput>, the wrappers intercept
+calls to <computeroutput>PMPI_Send</computeroutput>,
+<computeroutput>PMPI_Recv</computeroutput>, etc. They then
+use client requests to inform Valgrind of memory state changes caused
+by the function being wrapped. This reduces the number of false
+positives that Memcheck otherwise typically reports for MPI
+applications.</para>
+
+<para>The wrappers also take the opportunity to carefully check
+size and definedness of buffers passed as arguments to MPI functions, hence
+detecting errors such as passing undefined data to
+<computeroutput>PMPI_Send</computeroutput>, or receiving data into a
+buffer which is too small.</para>
+
+
+<sect2 id="manual-core.mpiwrap.build" xreflabel="Building MPI Wrappers">
+<title>Building and installing the wrappers</title>
+
+<para> The wrapper library will be built automatically if possible.
+Valgrind's configure script will look for a suitable
+<computeroutput>mpicc</computeroutput> to build it with. This must be
+the same <computeroutput>mpicc</computeroutput> you use to build the
+MPI application you want to debug. By default, Valgrind tries
+<computeroutput>mpicc</computeroutput>, but you can specify a
+different one by using the configure-time flag
+<computeroutput>--with-mpicc=</computeroutput>. Currently the
+wrappers are only buildable with
+<computeroutput>mpicc</computeroutput>s which are based on GNU
+<computeroutput>gcc</computeroutput> or Intel's
+<computeroutput>icc</computeroutput>.</para>
+
+<para>Check that the configure script prints a line like this:</para>
+
+<programlisting><![CDATA[
+checking for usable MPI2-compliant mpicc and mpi.h... yes, mpicc
+]]></programlisting>
+
+<para>If it says <computeroutput>... no</computeroutput>, your
+<computeroutput>mpicc</computeroutput> has failed to compile and link
+a test MPI2 program.</para>
+
+<para>If the configure test succeeds, continue in the usual way with
+<computeroutput>make</computeroutput> and <computeroutput>make
+install</computeroutput>. The final install tree should then contain
+<computeroutput>libmpiwrap.so</computeroutput>.
+</para>
+
+<para>Compile up a test MPI program (eg, MPI hello-world) and try
+this:</para>
+
+<programlisting><![CDATA[
+LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so \
+ mpirun [args] $prefix/bin/valgrind ./hello
+]]></programlisting>
+
+<para>You should see something similar to the following</para>
+
+<programlisting><![CDATA[
+valgrind MPI wrappers 31901: Active for pid 31901
+valgrind MPI wrappers 31901: Try MPIWRAP_DEBUG=help for possible options
+]]></programlisting>
+
+<para>repeated for every process in the group. If you do not see
+these, there is an build/installation problem of some kind.</para>
+
+<para> The MPI functions to be wrapped are assumed to be in an ELF
+shared object with soname matching
+<computeroutput>libmpi.so*</computeroutput>. This is known to be
+correct at least for Open MPI and Quadrics MPI, and can easily be
+changed if required.</para>
+</sect2>
+
+
+<sect2 id="manual-core.mpiwrap.gettingstarted"
+ xreflabel="Getting started with MPI Wrappers">
+<title>Getting started</title>
+
+<para>Compile your MPI application as usual, taking care to link it
+using the same <computeroutput>mpicc</computeroutput> that your
+Valgrind build was configured with.</para>
+
+<para>
+Use the following basic scheme to run your application on Valgrind with
+the wrappers engaged:</para>
+
+<programlisting><![CDATA[
+MPIWRAP_DEBUG=[wrapper-args] \
+ LD_PRELOAD=$prefix/lib/valgrind/<platform>/libmpiwrap.so \
+ mpirun [mpirun-args] \
+ $prefix/bin/valgrind [valgrind-args] \
+ [application] [app-args]
+]]></programlisting>
+
+<para>As an alternative to
+<computeroutput>LD_PRELOAD</computeroutput>ing
+<computeroutput>libmpiwrap.so</computeroutput>, you can simply link it
+to your application if desired. This should not disturb native
+behaviour of your application in any way.</para>
+</sect2>
+
+
+<sect2 id="manual-core.mpiwrap.controlling"
+ xreflabel="Controlling the MPI Wrappers">
+<title>Controlling the wrapper library</title>
+
+<para>Environment variable
+<computeroutput>MPIWRAP_DEBUG</computeroutput> is consulted at
+startup. The default behaviour is to print a starting banner</para>
+
+<programlisting><![CDATA[
+valgrind MPI wrappers 16386: Active for pid 16386
+valgrind MPI wrappers 16386: Try MPIWRAP_DEBUG=help for possible options
+]]></programlisting>
+
+<para> and then be relatively quiet.</para>
+
+<para>You can give a list of comma-separated options in
+<computeroutput>MPIWRAP_DEBUG</computeroutput>. These are</para>
+
+<itemizedlist>
+ <listitem>
+ <para><computeroutput>verbose</computeroutput>:
+ show entries/exits of all wrappers. Also show extra
+ debugging info, such as the status of outstanding
+ <computeroutput>MPI_Request</computeroutput>s resulting
+ from uncompleted <computeroutput>MPI_Irecv</computeroutput>s.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>quiet</computeroutput>:
+ opposite of <computeroutput>verbose</computeroutput>, only print
+ anything when the wrappers want
+ to report a detected programming error, or in case of catastrophic
+ failure of the wrappers.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>warn</computeroutput>:
+ by default, functions which lack proper wrappers
+ are not commented on, just silently
+ ignored. This causes a warning to be printed for each unwrapped
+ function used, up to a maximum of three warnings per function.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>strict</computeroutput>:
+ print an error message and abort the program if
+ a function lacking a wrapper is used.</para>
+ </listitem>
+</itemizedlist>
+
+<para> If you want to use Valgrind's XML output facility
+(<computeroutput>--xml=yes</computeroutput>), you should pass
+<computeroutput>quiet</computeroutput> in
+<computeroutput>MPIWRAP_DEBUG</computeroutput> so as to get rid of any
+extraneous printing from the wrappers.</para>
+
+</sect2>
+
+
+<sect2 id="manual-core.mpiwrap.limitations"
+ xreflabel="Abilities and Limitations of MPI Wrappers">
+<title>Abilities and limitations</title>
+
+<sect3>
+<title>Functions</title>
+
+<para>All MPI2 functions except
+<computeroutput>MPI_Wtick</computeroutput>,
+<computeroutput>MPI_Wtime</computeroutput> and
+<computeroutput>MPI_Pcontrol</computeroutput> have wrappers. The
+first two are not wrapped because they return a
+<computeroutput>double</computeroutput>, and Valgrind's
+function-wrap mechanism cannot handle that (it could easily enough be
+extended to). <computeroutput>MPI_Pcontrol</computeroutput> cannot be
+wrapped as it has variable arity:
+<computeroutput>int MPI_Pcontrol(const int level, ...)</computeroutput></para>
+
+<para>Most functions are wrapped with a default wrapper which does
+nothing except complain or abort if it is called, depending on
+settings in <computeroutput>MPIWRAP_DEBUG</computeroutput> listed
+above. The following functions have "real", do-something-useful
+wrappers:</para>
+
+<programlisting><![CDATA[
+PMPI_Send PMPI_Bsend PMPI_Ssend PMPI_Rsend
+
+PMPI_Recv PMPI_Get_count
+
+PMPI_Isend PMPI_Ibsend PMPI_Issend PMPI_Irsend
+
+PMPI_Irecv
+PMPI_Wait PMPI_Waitall
+PMPI_Test PMPI_Testall
+
+PMPI_Iprobe PMPI_Probe
+
+PMPI_Cancel
+
+PMPI_Sendrecv
+
+PMPI_Type_commit PMPI_Type_free
+
+PMPI_Bcast PMPI_Gather PMPI_Scatter PMPI_Alltoall
+PMPI_Reduce PMPI_Allreduce PMPI_Op_create
+
+PMPI_Comm_create PMPI_Comm_dup PMPI_Comm_free PMPI_Comm_rank PMPI_Comm_size
+
+PMPI_Error_string
+PMPI_Init PMPI_Initialized PMPI_Finalize
+]]></programlisting>
+
+<para> A few functions such as
+<computeroutput>PMPI_Address</computeroutput> are listed as
+<computeroutput>HAS_NO_WRAPPER</computeroutput>. They have no wrapper
+at all as there is nothing worth checking, and giving a no-op wrapper
+would reduce performance for no reason.</para>
+
+<para> Note that the wrapper library itself can itself generate large
+numbers of calls to the MPI implementation, especially when walking
+complex types. The most common functions called are
+<computeroutput>PMPI_Extent</computeroutput>,
+<computeroutput>PMPI_Type_get_envelope</computeroutput>,
+<computeroutput>PMPI_Type_get_contents</computeroutput>, and
+<computeroutput>PMPI_Type_free</computeroutput>. </para>
+</sect3>
+
+<sect3>
+<title>Types</title>
+
+<para> MPI-1.1 structured types are supported, and walked exactly.
+The currently supported combiners are
+<computeroutput>MPI_COMBINER_NAMED</computeroutput>,
+<computeroutput>MPI_COMBINER_CONTIGUOUS</computeroutput>,
+<computeroutput>MPI_COMBINER_VECTOR</computeroutput>,
+<computeroutput>MPI_COMBINER_HVECTOR</computeroutput>
+<computeroutput>MPI_COMBINER_INDEXED</computeroutput>,
+<computeroutput>MPI_COMBINER_HINDEXED</computeroutput> and
+<computeroutput>MPI_COMBINER_STRUCT</computeroutput>. This should
+cover all MPI-1.1 types. The mechanism (function
+<computeroutput>walk_type</computeroutput>) should extend easily to
+cover MPI2 combiners.</para>
+
+<para>MPI defines some named structured types
+(<computeroutput>MPI_FLOAT_INT</computeroutput>,
+<computeroutput>MPI_DOUBLE_INT</computeroutput>,
+<computeroutput>MPI_LONG_INT</computeroutput>,
+<computeroutput>MPI_2INT</computeroutput>,
+<computeroutput>MPI_SHORT_INT</computeroutput>,
+<computeroutput>MPI_LONG_DOUBLE_INT</computeroutput>) which are pairs
+of some basic type and a C <computeroutput>int</computeroutput>.
+Unfortunately the MPI specification makes it impossible to look inside
+these types and see where the fields are. Therefore these wrappers
+assume the types are laid out as <computeroutput>struct { float val;
+int loc; }</computeroutput> (for
+<computeroutput>MPI_FLOAT_INT</computeroutput>), etc, and act
+accordingly. This appears to be correct at least for Open MPI 1.0.2
+and for Quadrics MPI.</para>
+
+<para>If <computeroutput>strict</computeroutput> is an option specified
+in <computeroutput>MPIWRAP_DEBUG</computeroutput>, the application
+will abort if an unhandled type is encountered. Otherwise, the
+application will print a warning message and continue.</para>
+
+<para>Some effort is made to mark/check memory ranges corresponding to
+arrays of values in a single pass. This is important for performance
+since asking Valgrind to mark/check any range, no matter how small,
+carries quite a large constant cost. This optimisation is applied to
+arrays of primitive types (<computeroutput>double</computeroutput>,
+<computeroutput>float</computeroutput>,
+<computeroutput>int</computeroutput>,
+<computeroutput>long</computeroutput>, <computeroutput>long
+long</computeroutput>, <computeroutput>short</computeroutput>,
+<computeroutput>char</computeroutput>, and <computeroutput>long
+double</computeroutput> on platforms where <computeroutput>sizeof(long
+double) == 8</computeroutput>). For arrays of all other types, the
+wrappers handle each element individually and so there can be a very
+large performance cost.</para>
+
+</sect3>
+
+</sect2>
+
+
+<sect2 id="manual-core.mpiwrap.writingwrappers"
+ xreflabel="Writing new MPI Wrappers">
+<title>Writing new wrappers</title>
+
+<para>
+For the most part the wrappers are straightforward. The only
+significant complexity arises with nonblocking receives.</para>
+
+<para>The issue is that <computeroutput>MPI_Irecv</computeroutput>
+states the recv buffer and returns immediately, giving a handle
+(<computeroutput>MPI_Request</computeroutput>) for the transaction.
+Later the user will have to poll for completion with
+<computeroutput>MPI_Wait</computeroutput> etc, and when the
+transaction completes successfully, the wrappers have to paint the
+recv buffer. But the recv buffer details are not presented to
+<computeroutput>MPI_Wait</computeroutput> -- only the handle is. The
+library therefore maintains a shadow table which associates
+uncompleted <computeroutput>MPI_Request</computeroutput>s with the
+corresponding buffer address/count/type. When an operation completes,
+the table is searched for the associated address/count/type info, and
+memory is marked accordingly.</para>
+
+<para>Access to the table is guarded by a (POSIX pthreads) lock, so as
+to make the library thread-safe.</para>
+
+<para>The table is allocated with
+<computeroutput>malloc</computeroutput> and never
+<computeroutput>free</computeroutput>d, so it will show up in leak
+checks.</para>
+
+<para>Writing new wrappers should be fairly easy. The source file is
+<computeroutput>auxprogs/libmpiwrap.c</computeroutput>. If possible,
+find an existing wrapper for a function of similar behaviour to the
+one you want to wrap, and use it as a starting point. The wrappers
+are organised in sections in the same order as the MPI 1.1 spec, to
+aid navigation. When adding a wrapper, remember to comment out the
+definition of the default wrapper in the long list of defaults at the
+bottom of the file (do not remove it, just comment it out).</para>
+</sect2>
+
+<sect2 id="manual-core.mpiwrap.whattoexpect"
+ xreflabel="What to expect with MPI Wrappers">
+<title>What to expect when using the wrappers</title>
+
+<para>The wrappers should reduce Memcheck's false-error rate on MPI
+applications. Because the wrapping is done at the MPI interface,
+there will still potentially be a large number of errors reported in
+the MPI implementation below the interface. The best you can do is
+try to suppress them.</para>
+
+<para>You may also find that the input-side (buffer
+length/definedness) checks find errors in your MPI use, for example
+passing too short a buffer to
+<computeroutput>MPI_Recv</computeroutput>.</para>
+
+<para>Functions which are not wrapped may increase the false
+error rate. A possible approach is to run with
+<computeroutput>MPI_DEBUG</computeroutput> containing
+<computeroutput>warn</computeroutput>. This will show you functions
+which lack proper wrappers but which are nevertheless used. You can
+then write wrappers for them.
+</para>
+
+</sect2>
+
+</sect1>
+
</chapter>