<chapter id="cg-manual" xreflabel="Cachegrind: a cache-miss profiler">
<title>Cachegrind: a cache profiler</title>
-<para>Detailed technical documentation on how Cachegrind works is
-available in <xref linkend="cg-tech-docs"/>. If you only want to know
-how to <command>use</command> it, this is the page you need to
-read.</para>
-
-
<sect1 id="cg-manual.cache" xreflabel="Cache profiling">
<title>Cache profiling</title>
</sect2>
+</sect1>
+
+<sect1>
+<title>Implementation details</title>
+This section talks about details you don't need to know about in order to
+use Cachegrind, but may be of interest to some people.
<sect2>
-<title>Todo</title>
+<title>How Cachegrind works</title>
+<para>The best reference for understanding how Cachegrind works is chapter 3 of
+"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It
+is available on the publications page of the Valgrind website.</para>
+</sect2>
+<sect2>
+<title>Cachegrind output file format</title>
+<para>The file format is fairly straightforward, basically giving the
+cost centre for every line, grouped by files and
+functions. Total counts (eg. total cache accesses, total L1
+misses) are calculated when traversing this structure rather than
+during execution, to save time; the cache simulation functions
+are called so often that even one or two extra adds can make a
+sizeable difference.</para>
+
+<para>The file format:</para>
+<programlisting><![CDATA[
+file ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line ::= "desc:" ws? non_nl_string
+cmd_line ::= "cmd:" ws? cmd
+events_line ::= "events:" ws? (event ws)+
+data_line ::= file_line | fn_line | count_line
+file_line ::= "fl=" filename
+fn_line ::= "fn=" fn_name
+count_line ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count ::= num | "."]]></programlisting>
+
+<para>Where:</para>
<itemizedlist>
<listitem>
- <para>Program start-up/shut-down calls a lot of functions
- that aren't interesting and just complicate the output.
- Would be nice to exclude these somehow.</para>
+ <para><computeroutput>non_nl_string</computeroutput> is any
+ string not containing a newline.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>cmd</computeroutput> is a string holding the
+ command line of the profiled program.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>filename</computeroutput> and
+ <computeroutput>fn_name</computeroutput> are strings.</para>
</listitem>
-</itemizedlist>
+ <listitem>
+ <para><computeroutput>num</computeroutput> and
+ <computeroutput>line_num</computeroutput> are decimal
+ numbers.</para>
+ </listitem>
+ <listitem>
+ <para><computeroutput>ws</computeroutput> is whitespace.</para>
+ </listitem>
+</itemizedlist>
+
+<para>The contents of the "desc:" lines are printed out at the top
+of the summary. This is a generic way of providing simulation
+specific information, eg. for giving the cache configuration for
+cache simulation.</para>
+
+<para>More than one line of info can be presented for each file/fn/line number.
+In such cases, the counts for the named events will be accumulated.</para>
+
+<para>Counts can be "." to represent zero. This makes the files easier to
+read.</para>
+
+<para>The number of counts in each
+<computeroutput>line</computeroutput> and the
+<computeroutput>summary_line</computeroutput> should not exceed
+the number of events in the
+<computeroutput>event_line</computeroutput>. If the number in
+each <computeroutput>line</computeroutput> is less, cg_annotate
+treats those missing as though they were a "." entry.</para>
+
+<para>A <computeroutput>file_line</computeroutput> changes the
+current file name. A <computeroutput>fn_line</computeroutput>
+changes the current function name. A
+<computeroutput>count_line</computeroutput> contains counts that
+pertain to the current filename/fn_name. A "fn="
+<computeroutput>file_line</computeroutput> and a
+<computeroutput>fn_line</computeroutput> must appear before any
+<computeroutput>count_line</computeroutput>s to give the context
+of the first <computeroutput>count_line</computeroutput>s.</para>
+
+<para>Each <computeroutput>file_line</computeroutput> will normally be
+immediately followed by a <computeroutput>fn_line</computeroutput>. But it
+doesn't have to be.</para>
+
</sect2>
+++ /dev/null
-<?xml version="1.0"?> <!-- -*- sgml -*- -->
-<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
- "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
-
-<chapter id="cg-tech-docs" xreflabel="How Cachegrind works">
-
-<title>How Cachegrind works</title>
-
-<sect1 id="cg-tech-docs.profiling" xreflabel="Cache profiling">
-<title>Cache profiling</title>
-
-<para>[Note: this document is now very old, and a lot of its contents are out
-of date, and misleading.]</para>
-
-<para>Valgrind is a very nice platform for doing cache profiling
-and other kinds of simulation, because it converts horrible x86
-instructions into nice clean RISC-like UCode. For example, for
-cache profiling we are interested in instructions that read and
-write memory; in UCode there are only four instructions that do
-this: <computeroutput>LOAD</computeroutput>,
-<computeroutput>STORE</computeroutput>,
-<computeroutput>FPU_R</computeroutput> and
-<computeroutput>FPU_W</computeroutput>. By contrast, because of
-the x86 addressing modes, almost every instruction can read or
-write memory.</para>
-
-<para>Most of the cache profiling machinery is in the file
-<filename>vg_cachesim.c</filename>.</para>
-
-<para>These notes are a somewhat haphazard guide to how
-Valgrind's cache profiling works.</para>
-
-</sect1>
-
-
-<sect1 id="cg-tech-docs.costcentres" xreflabel="Cost centres">
-<title>Cost centres</title>
-
-<para>Valgrind gathers cache profiling about every instruction
-executed, individually. Each instruction has a <command>cost
-centre</command> associated with it. There are two kinds of cost
-centre: one for instructions that don't reference memory
-(<computeroutput>iCC</computeroutput>), and one for instructions
-that do (<computeroutput>idCC</computeroutput>):</para>
-
-<programlisting><![CDATA[
-typedef struct _CC {
- ULong a;
- ULong m1;
- ULong m2;
-} CC;
-
-typedef struct _iCC {
- /* word 1 */
- UChar tag;
- UChar instr_size;
-
- /* words 2+ */
- Addr instr_addr;
- CC I;
-} iCC;
-
-typedef struct _idCC {
- /* word 1 */
- UChar tag;
- UChar instr_size;
- UChar data_size;
-
- /* words 2+ */
- Addr instr_addr;
- CC I;
- CC D;
-} idCC; ]]></programlisting>
-
-<para>Each <computeroutput>CC</computeroutput> has three fields
-<computeroutput>a</computeroutput>,
-<computeroutput>m1</computeroutput>,
-<computeroutput>m2</computeroutput> for recording references,
-level 1 misses and level 2 misses. Each of these is a 64-bit
-<computeroutput>ULong</computeroutput> -- the numbers can get
-very large, ie. greater than 4.2 billion allowed by a 32-bit
-unsigned int.</para>
-
-<para>A <computeroutput>iCC</computeroutput> has one
-<computeroutput>CC</computeroutput> for instruction cache
-accesses. A <computeroutput>idCC</computeroutput> has two, one
-for instruction cache accesses, and one for data cache
-accesses.</para>
-
-<para>The <computeroutput>iCC</computeroutput> and
-<computeroutput>dCC</computeroutput> structs also store
-unchanging information about the instruction:</para>
-<itemizedlist>
- <listitem>
- <para>An instruction-type identification tag (explained
- below)</para>
- </listitem>
- <listitem>
- <para>Instruction size</para>
- </listitem>
- <listitem>
- <para>Data reference size
- (<computeroutput>idCC</computeroutput> only)</para>
- </listitem>
- <listitem>
- <para>Instruction address</para>
- </listitem>
-</itemizedlist>
-
-<para>Note that data address is not one of the fields for
-<computeroutput>idCC</computeroutput>. This is because for many
-memory-referencing instructions the data address can change each
-time it's executed (eg. if it uses register-offset addressing).
-We have to give this item to the cache simulation in a different
-way (see Instrumentation section below). Some memory-referencing
-instructions do always reference the same address, but we don't
-try to treat them specialy in order to keep things simple.</para>
-
-<para>Also note that there is only room for recording info about
-one data cache access in an
-<computeroutput>idCC</computeroutput>. So what about
-instructions that do a read then a write, such as:</para>
-<programlisting><![CDATA[
-inc %(esi)]]></programlisting>
-
-<para>In a write-allocate cache, as simulated by Valgrind, the
-write cannot miss, since it immediately follows the read which
-will drag the block into the cache if it's not already there. So
-the write access isn't really interesting, and Valgrind doesn't
-record it. This means that Valgrind doesn't measure memory
-references, but rather memory references that could miss in the
-cache. This behaviour is the same as that used by the AMD Athlon
-hardware counters. It also has the benefit of simplifying the
-implementation -- instructions that read and write memory can be
-treated like instructions that read memory.</para>
-
-</sect1>
-
-
-<sect1 id="cg-tech-docs.ccstore" xreflabel="Storing cost-centres">
-<title>Storing cost-centres</title>
-
-<para>Cost centres are stored in a way that makes them very cheap
-to lookup, which is important since one is looked up for every
-original x86 instruction executed.</para>
-
-<para>Valgrind does JIT translations at the basic block level,
-and cost centres are also setup and stored at the basic block
-level. By doing things carefully, we store all the cost centres
-for a basic block in a contiguous array, and lookup comes almost
-for free.</para>
-
-<para>Consider this part of a basic block (for exposition
-purposes, pretend it's an entire basic block):</para>
-<programlisting><![CDATA[
-movl $0x0,%eax
-movl $0x99, -4(%ebp)]]></programlisting>
-
-<para>The translation to UCode looks like this:</para>
-<programlisting><![CDATA[
-MOVL $0x0, t20
-PUTL t20, %EAX
-INCEIPo $5
-
-LEA1L -4(t4), t14
-MOVL $0x99, t18
-STL t18, (t14)
-INCEIPo $7]]></programlisting>
-
-<para>The first step is to allocate the cost centres. This
-requires a preliminary pass to count how many x86 instructions
-were in the basic block, and their types (and thus sizes). UCode
-translations for single x86 instructions are delimited by the
-<computeroutput>INCEIPo</computeroutput> instruction, the
-argument of which gives the byte size of the instruction (note
-that lazy INCEIP updating is turned off to allow this).</para>
-
-<para>We can tell if an x86 instruction references memory by
-looking for <computeroutput>LDL</computeroutput> and
-<computeroutput>STL</computeroutput> UCode instructions, and thus
-what kind of cost centre is required. From this we can determine
-how many cost centres we need for the basic block, and their
-sizes. We can then allocate them in a single array.</para>
-
-<para>Consider the example code above. After the preliminary
-pass, we know we need two cost centres, one
-<computeroutput>iCC</computeroutput> and one
-<computeroutput>dCC</computeroutput>. So we allocate an array to
-store these which looks like this:</para>
-
-<programlisting><![CDATA[
-|(uninit)| tag (1 byte)
-|(uninit)| instr_size (1 bytes)
-|(uninit)| (padding) (2 bytes)
-|(uninit)| instr_addr (4 bytes)
-|(uninit)| I.a (8 bytes)
-|(uninit)| I.m1 (8 bytes)
-|(uninit)| I.m2 (8 bytes)
-
-|(uninit)| tag (1 byte)
-|(uninit)| instr_size (1 byte)
-|(uninit)| data_size (1 byte)
-|(uninit)| (padding) (1 byte)
-|(uninit)| instr_addr (4 bytes)
-|(uninit)| I.a (8 bytes)
-|(uninit)| I.m1 (8 bytes)
-|(uninit)| I.m2 (8 bytes)
-|(uninit)| D.a (8 bytes)
-|(uninit)| D.m1 (8 bytes)
-|(uninit)| D.m2 (8 bytes)]]></programlisting>
-
-<para>(We can see now why we need tags to distinguish between the
-two types of cost centres.)</para>
-
-<para>We also record the size of the array. We look up the debug
-info of the first instruction in the basic block, and then stick
-the array into a table indexed by filename and function name.
-This makes it easy to dump the information quickly to file at the
-end.</para>
-
-</sect1>
-
-
-<sect1 id="cg-tech-docs.instrum" xreflabel="Instrumentation">
-<title>Instrumentation</title>
-
-<para>The instrumentation pass has two main jobs:</para>
-
-<orderedlist>
- <listitem>
- <para>Fill in the gaps in the allocated cost centres.</para>
- </listitem>
- <listitem>
- <para>Add UCode to call the cache simulator for each
- instruction.</para>
- </listitem>
-</orderedlist>
-
-<para>The instrumentation pass steps through the UCode and the
-cost centres in tandem. As each original x86 instruction's UCode
-is processed, the appropriate gaps in the instructions cost
-centre are filled in, for example:</para>
-
-<programlisting><![CDATA[
-|INSTR_CC| tag (1 byte)
-|5 | instr_size (1 bytes)
-|(uninit)| (padding) (2 bytes)
-|i_addr1 | instr_addr (4 bytes)
-|0 | I.a (8 bytes)
-|0 | I.m1 (8 bytes)
-|0 | I.m2 (8 bytes)
-
-|WRITE_CC| tag (1 byte)
-|7 | instr_size (1 byte)
-|4 | data_size (1 byte)
-|(uninit)| (padding) (1 byte)
-|i_addr2 | instr_addr (4 bytes)
-|0 | I.a (8 bytes)
-|0 | I.m1 (8 bytes)
-|0 | I.m2 (8 bytes)
-|0 | D.a (8 bytes)
-|0 | D.m1 (8 bytes)
-|0 | D.m2 (8 bytes)]]></programlisting>
-
-<para>(Note that this step is not performed if a basic block is
-re-translated; see <xref linkend="cg-tech-docs.retranslations"/> for
-more information.)</para>
-
-<para>GCC inserts padding before the
-<computeroutput>instr_size</computeroutput> field so that it is
-word aligned.</para>
-
-<para>The instrumentation added to call the cache simulation
-function looks like this (instrumentation is indented to
-distinguish it from the original UCode):</para>
-
-<programlisting><![CDATA[
-MOVL $0x0, t20
-PUTL t20, %EAX
- PUSHL %eax
- PUSHL %ecx
- PUSHL %edx
- MOVL $0x4091F8A4, t46 # address of 1st CC
- PUSHL t46
- CALLMo $0x12 # second cachesim function
- CLEARo $0x4
- POPL %edx
- POPL %ecx
- POPL %eax
-INCEIPo $5
-
-LEA1L -4(t4), t14
-MOVL $0x99, t18
- MOVL t14, t42
-STL t18, (t14)
- PUSHL %eax
- PUSHL %ecx
- PUSHL %edx
- PUSHL t42
- MOVL $0x4091F8C4, t44 # address of 2nd CC
- PUSHL t44
- CALLMo $0x13 # second cachesim function
- CLEARo $0x8
- POPL %edx
- POPL %ecx
- POPL %eax
-INCEIPo $7]]></programlisting>
-
-<para>Consider the first instruction's UCode. Each call is
-surrounded by three <computeroutput>PUSHL</computeroutput> and
-<computeroutput>POPL</computeroutput> instructions to save and
-restore the caller-save registers. Then the address of the
-instruction's cost centre is pushed onto the stack, to be the
-first argument to the cache simulation function. The address is
-known at this point because we are doing a simultaneous pass
-through the cost centre array. This means the cost centre lookup
-for each instruction is almost free (just the cost of pushing an
-argument for a function call). Then the call to the cache
-simulation function for non-memory-reference instructions is made
-(note that the <computeroutput>CALLMo</computeroutput>
-UInstruction takes an offset into a table of predefined
-functions; it is not an absolute address), and the single
-argument is <computeroutput>CLEAR</computeroutput>ed from the
-stack.</para>
-
-<para>The second instruction's UCode is similar. The only
-difference is that, as mentioned before, we have to pass the
-address of the data item referenced to the cache simulation
-function too. This explains the <computeroutput>MOVL t14,
-t42</computeroutput> and <computeroutput>PUSHL
-t42</computeroutput> UInstructions. (Note that the seemingly
-redundant <computeroutput>MOV</computeroutput>ing will probably
-be optimised away during register allocation.)</para>
-
-<para>Note that instead of storing unchanging information about
-each instruction (instruction size, data size, etc) in its cost
-centre, we could have passed in these arguments to the simulation
-function. But this would slow the calls down (two or three extra
-arguments pushed onto the stack). Also it would bloat the UCode
-instrumentation by amounts similar to the space required for them
-in the cost centre; bloated UCode would also fill the translation
-cache more quickly, requiring more translations for large
-programs and slowing them down more.</para>
-
-</sect1>
-
-
-<sect1 id="cg-tech-docs.retranslations"
- xreflabel="Handling basic block retranslations">
-<title>Handling basic block retranslations</title>
-
-<para>The above description ignores one complication. Valgrind
-has a limited size cache for basic block translations; if it
-fills up, old translations are discarded. If a discarded basic
-block is executed again, it must be re-translated.</para>
-
-<para>However, we can't use this approach for profiling -- we
-can't throw away cost centres for instructions in the middle of
-execution! So when a basic block is translated, we first look
-for its cost centre array in the hash table. If there is no cost
-centre array, it must be the first translation, so we proceed as
-described above. But if there is a cost centre array already, it
-must be a retranslation. In this case, we skip the cost centre
-allocation and initialisation steps, but still do the UCode
-instrumentation step.</para>
-
-</sect1>
-
-
-
-<sect1 id="cg-tech-docs.cachesim" xreflabel="The cache simulation">
-<title>The cache simulation</title>
-
-<para>The cache simulation is fairly straightforward. It just
-tracks which memory blocks are in the cache at the moment (it
-doesn't track the contents, since that is irrelevant).</para>
-
-<para>The interface to the simulation is quite clean. The
-functions called from the UCode contain calls to the simulation
-functions in the files
-<filename>vg_cachesim_{I1,D1,L2}.c</filename>; these calls are
-inlined so that only one function call is done per simulated x86
-instruction. The file <filename>vg_cachesim.c</filename> simply
-<computeroutput>#include</computeroutput>s the three files
-containing the simulation, which makes plugging in new cache
-simulations is very easy -- you just replace the three files and
-recompile.</para>
-
-</sect1>
-
-
-<sect1 id="cg-tech-docs.output" xreflabel="Output">
-<title>Output</title>
-
-<para>Output is fairly straightforward, basically printing the
-cost centre for every instruction, grouped by files and
-functions. Total counts (eg. total cache accesses, total L1
-misses) are calculated when traversing this structure rather than
-during execution, to save time; the cache simulation functions
-are called so often that even one or two extra adds can make a
-sizeable difference.</para>
-
-<para>Input file has the following format:</para>
-<programlisting><![CDATA[
-file ::= desc_line* cmd_line events_line data_line+ summary_line
-desc_line ::= "desc:" ws? non_nl_string
-cmd_line ::= "cmd:" ws? cmd
-events_line ::= "events:" ws? (event ws)+
-data_line ::= file_line | fn_line | count_line
-file_line ::= ("fl=" | "fi=" | "fe=") filename
-fn_line ::= "fn=" fn_name
-count_line ::= line_num ws? (count ws)+
-summary_line ::= "summary:" ws? (count ws)+
-count ::= num | "."]]></programlisting>
-
-<para>Where:</para>
-<itemizedlist>
- <listitem>
- <para><computeroutput>non_nl_string</computeroutput> is any
- string not containing a newline.</para>
- </listitem>
- <listitem>
- <para><computeroutput>cmd</computeroutput> is a command line
- invocation.</para>
- </listitem>
- <listitem>
- <para><computeroutput>filename</computeroutput> and
- <computeroutput>fn_name</computeroutput> can be anything.</para>
- </listitem>
- <listitem>
- <para><computeroutput>num</computeroutput> and
- <computeroutput>line_num</computeroutput> are decimal
- numbers.</para>
- </listitem>
- <listitem>
- <para><computeroutput>ws</computeroutput> is whitespace.</para>
- </listitem>
- <listitem>
- <para><computeroutput>nl</computeroutput> is a newline.</para>
- </listitem>
-
-</itemizedlist>
-
-<para>The contents of the "desc:" lines is printed out at the top
-of the summary. This is a generic way of providing simulation
-specific information, eg. for giving the cache configuration for
-cache simulation.</para>
-
-<para>Counts can be "." to represent "N/A", eg. the number of
-write misses for an instruction that doesn't write to
-memory.</para>
-
-<para>The number of counts in each
-<computeroutput>line</computeroutput> and the
-<computeroutput>summary_line</computeroutput> should not exceed
-the number of events in the
-<computeroutput>event_line</computeroutput>. If the number in
-each <computeroutput>line</computeroutput> is less, cg_annotate
-treats those missing as though they were a "." entry.</para>
-
-<para>A <computeroutput>file_line</computeroutput> changes the
-current file name. A <computeroutput>fn_line</computeroutput>
-changes the current function name. A
-<computeroutput>count_line</computeroutput> contains counts that
-pertain to the current filename/fn_name. A "fn="
-<computeroutput>file_line</computeroutput> and a
-<computeroutput>fn_line</computeroutput> must appear before any
-<computeroutput>count_line</computeroutput>s to give the context
-of the first <computeroutput>count_line</computeroutput>s.</para>
-
-<para>Each <computeroutput>file_line</computeroutput> should be
-immediately followed by a
-<computeroutput>fn_line</computeroutput>. "fi="
-<computeroutput>file_lines</computeroutput> are used to switch
-filenames for inlined functions; "fe="
-<computeroutput>file_lines</computeroutput> are similar, but are
-put at the end of a basic block in which the file name hasn't
-been switched back to the original file name. (fi and fe lines
-behave the same, they are only distinguished to help
-debugging.)</para>
-
-</sect1>
-
-
-
-<sect1 id="cg-tech-docs.summary"
- xreflabel="Summary of performance features">
-<title>Summary of performance features</title>
-
-<para>Quite a lot of work has gone into making the profiling as
-fast as possible. This is a summary of the important
-features:</para>
-
-<itemizedlist>
-
- <listitem>
- <para>The basic block-level cost centre storage allows almost
- free cost centre lookup.</para>
- </listitem>
-
- <listitem>
- <para>Only one function call is made per instruction
- simulated; even this accounts for a sizeable percentage of
- execution time, but it seems unavoidable if we want
- flexibility in the cache simulator.</para>
- </listitem>
-
- <listitem>
- <para>Unchanging information about an instruction is stored
- in its cost centre, avoiding unnecessary argument pushing,
- and minimising UCode instrumentation bloat.</para>
- </listitem>
-
- <listitem>
- <para>Summary counts are calculated at the end, rather than
- during execution.</para>
- </listitem>
-
- <listitem>
- <para>The <computeroutput>cachegrind.out</computeroutput>
- output files can contain huge amounts of information; file
- format was carefully chosen to minimise file sizes.</para>
- </listitem>
-
-</itemizedlist>
-
-</sect1>
-
-
-
-<sect1 id="cg-tech-docs.annotate" xreflabel="Annotation">
-<title>Annotation</title>
-
-<para>Annotation is done by cg_annotate. It is a fairly
-straightforward Perl script that slurps up all the cost centres,
-and then runs through all the chosen source files, printing out
-cost centres with them. It too has been carefully optimised.</para>
-
-</sect1>
-
-
-
-<sect1 id="cg-tech-docs.extensions" xreflabel="Similar work, extensions">
-<title>Similar work, extensions</title>
-
-<para>It would be relatively straightforward to do other
-simulations and obtain line-by-line information about interesting
-events. A good example would be branch prediction -- all
-branches could be instrumented to interact with a branch
-prediction simulator, using very similar techniques to those
-described above.</para>
-
-<para>In particular, cg_annotate would not need to change -- the
-file format is such that it is not specific to the cache
-simulation, but could be used for any kind of line-by-line
-information. The only part of cg_annotate that is specific to
-the cache simulation is the name of the input file
-(<computeroutput>cachegrind.out</computeroutput>), although it
-would be very simple to add an option to control this.</para>
-
-</sect1>
-
-</chapter>