Also, since one instruction cache read is performed per instruction executed,
you can find out how many instructions are executed per line, which can be
-useful for optimisation and test coverage.<p>
+useful for traditional profiling and test coverage.<p>
Please note that this is an experimental feature. Any feedback, bug-fixes,
suggestions, etc, welcome.
The steps are described in detail in the following sections.<p>
-<h3>7.3 Cache simulation specifics</h3>
+<h3>7.2 Cache simulation specifics</h3>
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
L2 cache. This configuration is used for all (modern) x86-based machines we
interested to hear from anyone who does.
<a name="profile"></a>
-<h3>7.4 Profiling programs</h3>
+<h3>7.3 Profiling programs</h3>
Cache profiling is enabled by using the <code>--cachesim=yes</code>
option to the <code>valgrind</code> shell script. Alternatively, it
Combined instruction and data figures for the L2 cache follow that.<p>
-<h3>7.5 Output file</h3>
+<h3>7.4 Output file</h3>
As well as printing summary information, Cachegrind also writes
line-by-line cache profiling information to a file named
</ul>
<a name="profile"></a>
-<h3>7.6 Cachegrind options</h3>
+<h3>7.5 Cachegrind options</h3>
Cachegrind accepts all the options that Valgrind does, although some of them
(ones related to memory checking) don't do anything when cache profiling.<p>
The interesting cache-simulation specific options are:
- <li><code>--I1=<size>,<associativity>,<line_size></code><p>
- <code>--D1=<size>,<associativity>,<line_size></code><p>
+ <li><code>--I1=<size>,<associativity>,<line_size></code><br>
+ <code>--D1=<size>,<associativity>,<line_size></code><br>
<code>--L2=<size>,<associativity>,<line_size></code><p>
- [default: uses CPUID for cache configuration]<p>
+ [default: uses CPUID for automagic cache configuration]<p>
Manually specifies the I1/D1/L2 cache configuration, where
<code>size</code> and <code>line_size</code> are measured in bytes. The
- three items must be comma-separated, but with no space, eg:
+ three items must be comma-separated, but with no spaces, eg:
<blockquote><code>cachegrind --I1=65535,2,64</code></blockquote>
- You can specify one, two or three of the caches. Any level not manually
- specified will be simulated using the configuration found in the normal
- way (via the CPUID instruction, or failing that, via defaults).
+ You can specify one, two or three of the I1/D1/L2 caches. Any level not
+ manually specified will be simulated using the configuration found in the
+ normal way (via the CPUID instruction, or failing that, via defaults).
</ul>
-
-
+
<a name="annotate"></a>
-<h3>7.7 Annotating C/C++ programs</h3>
+<h3>7.6 Annotating C/C++ programs</h3>
Before using <code>vg_annotate</code>, it is worth widening your
window to be at least 120-characters wide if possible, as the output
auto-annotation can produce a lot of output if your program is large!
-<h3>7.8 Annotating assembler programs</h3>
+<h3>7.7 Annotating assembler programs</h3>
Valgrind can annotate assembler programs too, or annotate the
assembler generated for your C program. Sometimes this is useful for
programs.
-<h3>7.9 <code>vg_annotate</code> options</h3>
+<h3>7.8 <code>vg_annotate</code> options</h3>
<ul>
<li><code>-h, --help</code></li><p>
<li><code>-v, --version</code><p>
</ul>
-<h3>7.10 Warnings</h3>
+<h3>7.9 Warnings</h3>
There are a couple of situations in which vg_annotate issues warnings.
<ul>
</ul>
-<h3>7.11 Things to watch out for</h3>
+<h3>7.10 Things to watch out for</h3>
Some odd things that can occur during annotation:
<ul>
How can the third instruction be executed twice when the others are
executed only once? As it turns out, it isn't. Here's a dump of the
- executable, from objdump:
+ executable, using <code>objdump -d</code>:
<pre>
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
<li>Files with more than 65,535 lines cause difficulties for the stabs debug
info reader. This is because the line number in the <code>struct
nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
- number. Valgrind can handle some files with more than 65,535 lines
+ value. Valgrind can handle some files with more than 65,535 lines
correctly by making some guesses to identify line number overflows. But
some cases are beyond it, in which case you'll get a warning message
explaining that annotations for the file might be incorrect.<p>
please let us know.<p>
-<h3>7.12 Accuracy</h3>
+<h3>7.11 Accuracy</h3>
Valgrind's cache profiling has a number of shortcomings:
<ul>
ways to the standard <code>malloc()</code>, which could warp the results.
</li><p>
+ <li>Valgrind's custom threads implementation will schedule threads
+ differently to the standard one. This too could warp the results for
+ threaded programs.
+ </li><p>
+
<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
will incorrectly be counted as doing a data read if both the arguments
are registers, eg:
hopefully they should be close enough to be useful.<p>
-<h3>7.13 Todo</h3>
+<h3>7.12 Todo</h3>
<ul>
<li>Program start-up/shut-down calls a lot of functions that aren't
interesting and just complicate the output. Would be nice to exclude
Also, since one instruction cache read is performed per instruction executed,
you can find out how many instructions are executed per line, which can be
-useful for optimisation and test coverage.<p>
+useful for traditional profiling and test coverage.<p>
Please note that this is an experimental feature. Any feedback, bug-fixes,
suggestions, etc, welcome.
The steps are described in detail in the following sections.<p>
-<h3>7.3 Cache simulation specifics</h3>
+<h3>7.2 Cache simulation specifics</h3>
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
L2 cache. This configuration is used for all (modern) x86-based machines we
interested to hear from anyone who does.
<a name="profile"></a>
-<h3>7.4 Profiling programs</h3>
+<h3>7.3 Profiling programs</h3>
Cache profiling is enabled by using the <code>--cachesim=yes</code>
option to the <code>valgrind</code> shell script. Alternatively, it
Combined instruction and data figures for the L2 cache follow that.<p>
-<h3>7.5 Output file</h3>
+<h3>7.4 Output file</h3>
As well as printing summary information, Cachegrind also writes
line-by-line cache profiling information to a file named
</ul>
<a name="profile"></a>
-<h3>7.6 Cachegrind options</h3>
+<h3>7.5 Cachegrind options</h3>
Cachegrind accepts all the options that Valgrind does, although some of them
(ones related to memory checking) don't do anything when cache profiling.<p>
The interesting cache-simulation specific options are:
- <li><code>--I1=<size>,<associativity>,<line_size></code><p>
- <code>--D1=<size>,<associativity>,<line_size></code><p>
+ <li><code>--I1=<size>,<associativity>,<line_size></code><br>
+ <code>--D1=<size>,<associativity>,<line_size></code><br>
<code>--L2=<size>,<associativity>,<line_size></code><p>
- [default: uses CPUID for cache configuration]<p>
+ [default: uses CPUID for automagic cache configuration]<p>
Manually specifies the I1/D1/L2 cache configuration, where
<code>size</code> and <code>line_size</code> are measured in bytes. The
- three items must be comma-separated, but with no space, eg:
+ three items must be comma-separated, but with no spaces, eg:
<blockquote><code>cachegrind --I1=65535,2,64</code></blockquote>
- You can specify one, two or three of the caches. Any level not manually
- specified will be simulated using the configuration found in the normal
- way (via the CPUID instruction, or failing that, via defaults).
+ You can specify one, two or three of the I1/D1/L2 caches. Any level not
+ manually specified will be simulated using the configuration found in the
+ normal way (via the CPUID instruction, or failing that, via defaults).
</ul>
-
-
+
<a name="annotate"></a>
-<h3>7.7 Annotating C/C++ programs</h3>
+<h3>7.6 Annotating C/C++ programs</h3>
Before using <code>vg_annotate</code>, it is worth widening your
window to be at least 120-characters wide if possible, as the output
auto-annotation can produce a lot of output if your program is large!
-<h3>7.8 Annotating assembler programs</h3>
+<h3>7.7 Annotating assembler programs</h3>
Valgrind can annotate assembler programs too, or annotate the
assembler generated for your C program. Sometimes this is useful for
programs.
-<h3>7.9 <code>vg_annotate</code> options</h3>
+<h3>7.8 <code>vg_annotate</code> options</h3>
<ul>
<li><code>-h, --help</code></li><p>
<li><code>-v, --version</code><p>
</ul>
-<h3>7.10 Warnings</h3>
+<h3>7.9 Warnings</h3>
There are a couple of situations in which vg_annotate issues warnings.
<ul>
</ul>
-<h3>7.11 Things to watch out for</h3>
+<h3>7.10 Things to watch out for</h3>
Some odd things that can occur during annotation:
<ul>
How can the third instruction be executed twice when the others are
executed only once? As it turns out, it isn't. Here's a dump of the
- executable, from objdump:
+ executable, using <code>objdump -d</code>:
<pre>
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
<li>Files with more than 65,535 lines cause difficulties for the stabs debug
info reader. This is because the line number in the <code>struct
nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
- number. Valgrind can handle some files with more than 65,535 lines
+ value. Valgrind can handle some files with more than 65,535 lines
correctly by making some guesses to identify line number overflows. But
some cases are beyond it, in which case you'll get a warning message
explaining that annotations for the file might be incorrect.<p>
please let us know.<p>
-<h3>7.12 Accuracy</h3>
+<h3>7.11 Accuracy</h3>
Valgrind's cache profiling has a number of shortcomings:
<ul>
ways to the standard <code>malloc()</code>, which could warp the results.
</li><p>
+ <li>Valgrind's custom threads implementation will schedule threads
+ differently to the standard one. This too could warp the results for
+ threaded programs.
+ </li><p>
+
<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
will incorrectly be counted as doing a data read if both the arguments
are registers, eg:
hopefully they should be close enough to be useful.<p>
-<h3>7.13 Todo</h3>
+<h3>7.12 Todo</h3>
<ul>
<li>Program start-up/shut-down calls a lot of functions that aren't
interesting and just complicate the output. Would be nice to exclude
Also, since one instruction cache read is performed per instruction executed,
you can find out how many instructions are executed per line, which can be
-useful for optimisation and test coverage.<p>
+useful for traditional profiling and test coverage.<p>
Please note that this is an experimental feature. Any feedback, bug-fixes,
suggestions, etc, welcome.
The steps are described in detail in the following sections.<p>
-<h3>7.3 Cache simulation specifics</h3>
+<h3>7.2 Cache simulation specifics</h3>
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
L2 cache. This configuration is used for all (modern) x86-based machines we
interested to hear from anyone who does.
<a name="profile"></a>
-<h3>7.4 Profiling programs</h3>
+<h3>7.3 Profiling programs</h3>
Cache profiling is enabled by using the <code>--cachesim=yes</code>
option to the <code>valgrind</code> shell script. Alternatively, it
Combined instruction and data figures for the L2 cache follow that.<p>
-<h3>7.5 Output file</h3>
+<h3>7.4 Output file</h3>
As well as printing summary information, Cachegrind also writes
line-by-line cache profiling information to a file named
</ul>
<a name="profile"></a>
-<h3>7.6 Cachegrind options</h3>
+<h3>7.5 Cachegrind options</h3>
Cachegrind accepts all the options that Valgrind does, although some of them
(ones related to memory checking) don't do anything when cache profiling.<p>
The interesting cache-simulation specific options are:
- <li><code>--I1=<size>,<associativity>,<line_size></code><p>
- <code>--D1=<size>,<associativity>,<line_size></code><p>
+ <li><code>--I1=<size>,<associativity>,<line_size></code><br>
+ <code>--D1=<size>,<associativity>,<line_size></code><br>
<code>--L2=<size>,<associativity>,<line_size></code><p>
- [default: uses CPUID for cache configuration]<p>
+ [default: uses CPUID for automagic cache configuration]<p>
Manually specifies the I1/D1/L2 cache configuration, where
<code>size</code> and <code>line_size</code> are measured in bytes. The
- three items must be comma-separated, but with no space, eg:
+ three items must be comma-separated, but with no spaces, eg:
<blockquote><code>cachegrind --I1=65535,2,64</code></blockquote>
- You can specify one, two or three of the caches. Any level not manually
- specified will be simulated using the configuration found in the normal
- way (via the CPUID instruction, or failing that, via defaults).
+ You can specify one, two or three of the I1/D1/L2 caches. Any level not
+ manually specified will be simulated using the configuration found in the
+ normal way (via the CPUID instruction, or failing that, via defaults).
</ul>
-
-
+
<a name="annotate"></a>
-<h3>7.7 Annotating C/C++ programs</h3>
+<h3>7.6 Annotating C/C++ programs</h3>
Before using <code>vg_annotate</code>, it is worth widening your
window to be at least 120-characters wide if possible, as the output
auto-annotation can produce a lot of output if your program is large!
-<h3>7.8 Annotating assembler programs</h3>
+<h3>7.7 Annotating assembler programs</h3>
Valgrind can annotate assembler programs too, or annotate the
assembler generated for your C program. Sometimes this is useful for
programs.
-<h3>7.9 <code>vg_annotate</code> options</h3>
+<h3>7.8 <code>vg_annotate</code> options</h3>
<ul>
<li><code>-h, --help</code></li><p>
<li><code>-v, --version</code><p>
</ul>
-<h3>7.10 Warnings</h3>
+<h3>7.9 Warnings</h3>
There are a couple of situations in which vg_annotate issues warnings.
<ul>
</ul>
-<h3>7.11 Things to watch out for</h3>
+<h3>7.10 Things to watch out for</h3>
Some odd things that can occur during annotation:
<ul>
How can the third instruction be executed twice when the others are
executed only once? As it turns out, it isn't. Here's a dump of the
- executable, from objdump:
+ executable, using <code>objdump -d</code>:
<pre>
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
<li>Files with more than 65,535 lines cause difficulties for the stabs debug
info reader. This is because the line number in the <code>struct
nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
- number. Valgrind can handle some files with more than 65,535 lines
+ value. Valgrind can handle some files with more than 65,535 lines
correctly by making some guesses to identify line number overflows. But
some cases are beyond it, in which case you'll get a warning message
explaining that annotations for the file might be incorrect.<p>
please let us know.<p>
-<h3>7.12 Accuracy</h3>
+<h3>7.11 Accuracy</h3>
Valgrind's cache profiling has a number of shortcomings:
<ul>
ways to the standard <code>malloc()</code>, which could warp the results.
</li><p>
+ <li>Valgrind's custom threads implementation will schedule threads
+ differently to the standard one. This too could warp the results for
+ threaded programs.
+ </li><p>
+
<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
will incorrectly be counted as doing a data read if both the arguments
are registers, eg:
hopefully they should be close enough to be useful.<p>
-<h3>7.13 Todo</h3>
+<h3>7.12 Todo</h3>
<ul>
<li>Program start-up/shut-down calls a lot of functions that aren't
interesting and just complicate the output. Would be nice to exclude
Also, since one instruction cache read is performed per instruction executed,
you can find out how many instructions are executed per line, which can be
-useful for optimisation and test coverage.<p>
+useful for traditional profiling and test coverage.<p>
Please note that this is an experimental feature. Any feedback, bug-fixes,
suggestions, etc, welcome.
The steps are described in detail in the following sections.<p>
-<h3>7.3 Cache simulation specifics</h3>
+<h3>7.2 Cache simulation specifics</h3>
Cachegrind uses a simulation for a machine with a split L1 cache and a unified
L2 cache. This configuration is used for all (modern) x86-based machines we
interested to hear from anyone who does.
<a name="profile"></a>
-<h3>7.4 Profiling programs</h3>
+<h3>7.3 Profiling programs</h3>
Cache profiling is enabled by using the <code>--cachesim=yes</code>
option to the <code>valgrind</code> shell script. Alternatively, it
Combined instruction and data figures for the L2 cache follow that.<p>
-<h3>7.5 Output file</h3>
+<h3>7.4 Output file</h3>
As well as printing summary information, Cachegrind also writes
line-by-line cache profiling information to a file named
</ul>
<a name="profile"></a>
-<h3>7.6 Cachegrind options</h3>
+<h3>7.5 Cachegrind options</h3>
Cachegrind accepts all the options that Valgrind does, although some of them
(ones related to memory checking) don't do anything when cache profiling.<p>
The interesting cache-simulation specific options are:
- <li><code>--I1=<size>,<associativity>,<line_size></code><p>
- <code>--D1=<size>,<associativity>,<line_size></code><p>
+ <li><code>--I1=<size>,<associativity>,<line_size></code><br>
+ <code>--D1=<size>,<associativity>,<line_size></code><br>
<code>--L2=<size>,<associativity>,<line_size></code><p>
- [default: uses CPUID for cache configuration]<p>
+ [default: uses CPUID for automagic cache configuration]<p>
Manually specifies the I1/D1/L2 cache configuration, where
<code>size</code> and <code>line_size</code> are measured in bytes. The
- three items must be comma-separated, but with no space, eg:
+ three items must be comma-separated, but with no spaces, eg:
<blockquote><code>cachegrind --I1=65535,2,64</code></blockquote>
- You can specify one, two or three of the caches. Any level not manually
- specified will be simulated using the configuration found in the normal
- way (via the CPUID instruction, or failing that, via defaults).
+ You can specify one, two or three of the I1/D1/L2 caches. Any level not
+ manually specified will be simulated using the configuration found in the
+ normal way (via the CPUID instruction, or failing that, via defaults).
</ul>
-
-
+
<a name="annotate"></a>
-<h3>7.7 Annotating C/C++ programs</h3>
+<h3>7.6 Annotating C/C++ programs</h3>
Before using <code>vg_annotate</code>, it is worth widening your
window to be at least 120-characters wide if possible, as the output
auto-annotation can produce a lot of output if your program is large!
-<h3>7.8 Annotating assembler programs</h3>
+<h3>7.7 Annotating assembler programs</h3>
Valgrind can annotate assembler programs too, or annotate the
assembler generated for your C program. Sometimes this is useful for
programs.
-<h3>7.9 <code>vg_annotate</code> options</h3>
+<h3>7.8 <code>vg_annotate</code> options</h3>
<ul>
<li><code>-h, --help</code></li><p>
<li><code>-v, --version</code><p>
</ul>
-<h3>7.10 Warnings</h3>
+<h3>7.9 Warnings</h3>
There are a couple of situations in which vg_annotate issues warnings.
<ul>
</ul>
-<h3>7.11 Things to watch out for</h3>
+<h3>7.10 Things to watch out for</h3>
Some odd things that can occur during annotation:
<ul>
How can the third instruction be executed twice when the others are
executed only once? As it turns out, it isn't. Here's a dump of the
- executable, from objdump:
+ executable, using <code>objdump -d</code>:
<pre>
8048f25: 8d 45 f4 lea 0xfffffff4(%ebp),%eax
<li>Files with more than 65,535 lines cause difficulties for the stabs debug
info reader. This is because the line number in the <code>struct
nlist</code> defined in <code>a.out.h</code> under Linux is only a 16-bit
- number. Valgrind can handle some files with more than 65,535 lines
+ value. Valgrind can handle some files with more than 65,535 lines
correctly by making some guesses to identify line number overflows. But
some cases are beyond it, in which case you'll get a warning message
explaining that annotations for the file might be incorrect.<p>
please let us know.<p>
-<h3>7.12 Accuracy</h3>
+<h3>7.11 Accuracy</h3>
Valgrind's cache profiling has a number of shortcomings:
<ul>
ways to the standard <code>malloc()</code>, which could warp the results.
</li><p>
+ <li>Valgrind's custom threads implementation will schedule threads
+ differently to the standard one. This too could warp the results for
+ threaded programs.
+ </li><p>
+
<li>The instructions <code>bts</code>, <code>btr</code> and <code>btc</code>
will incorrectly be counted as doing a data read if both the arguments
are registers, eg:
hopefully they should be close enough to be useful.<p>
-<h3>7.13 Todo</h3>
+<h3>7.12 Todo</h3>
<ul>
<li>Program start-up/shut-down calls a lot of functions that aren't
interesting and just complicate the output. Would be nice to exclude