From: Nicholas Nethercote Date: Fri, 3 May 2002 17:51:10 +0000 (+0000) Subject: Added section to tech docs on how cachegrind works, including the X-Git-Tag: svn/VALGRIND_1_0_3~254 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=ece0587a118fe439a83b56ca078f97ef03a3cca0;p=thirdparty%2Fvalgrind.git Added section to tech docs on how cachegrind works, including the cachegrind.out file format. Tiny change in user manual. git-svn-id: svn://svn.valgrind.org/valgrind/trunk@198 --- diff --git a/cachegrind/docs/manual.html b/cachegrind/docs/manual.html index 5644872d00..4b6b773915 100644 --- a/cachegrind/docs/manual.html +++ b/cachegrind/docs/manual.html @@ -1929,7 +1929,11 @@ particular, it records: On a modern x86 machine, an L1 miss will typically cost around 10 cycles, and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program. +very useful for improving the performance of your program.

+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for optimisation and test coverage.

Please note that this is an experimental feature. Any feedback, bug-fixes, suggestions, etc, welcome. diff --git a/cachegrind/docs/techdocs.html b/cachegrind/docs/techdocs.html index aea95c9bbd..5bfda47ee6 100644 --- a/cachegrind/docs/techdocs.html +++ b/cachegrind/docs/techdocs.html @@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.

+ + +


+ +

Cache profiling

+Valgrind is a very nice platform for doing cache profiling and other kinds of +simulation, because it converts horrible x86 instructions into nice clean +RISC-like UCode. For example, for cache profiling we are interested in +instructions that read and write memory; in UCode there are only four +instructions that do this: LOAD, STORE, +FPU_R and FPU_W. By contrast, because of the x86 +addressing modes, almost every instruction can read or write memory.

+ +Most of the cache profiling machinery is in the file +vg_cachesim.c.

+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.

+ +

Cost centres

+Valgrind gathers cache profiling about every instruction executed, +individually. Each instruction has a cost centre associated with it. +There are two kinds of cost centre: one for instructions that don't reference +memory (iCC), and one for instructions that do +(idCC): + +
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+
+ +Each CC has three fields a, m1, +m2 for recording references, level 1 misses and level 2 misses. +Each of these is a 64-bit ULong -- the numbers can get very large, +ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

+ +A iCC has one CC for instruction cache accesses. A +idCC has two, one for instruction cache accesses, and one for data +cache accesses.

+ +The iCC and dCC structs also store unchanging +information about the instruction: +

+ +Note that data address is not one of the fields for idCC. This is +because for many memory-referencing instructions the data address can change +each time it's executed (eg. if it uses register-offset addressing). We have +to give this item to the cache simulation in a different way (see +Instrumentation section below). Some memory-referencing instructions do always +reference the same address, but we don't try to treat them specialy in order to +keep things simple.

+ +Also note that there is only room for recording info about one data cache +access in an idCC. So what about instructions that do a read then +a write, such as: + +

inc %(esi)
+ +In a write-allocate cache, as simulated by Valgrind, the write cannot miss, +since it immediately follows the read which will drag the block into the cache +if it's not already there. So the write access isn't really interesting, and +Valgrind doesn't record it. This means that Valgrind doesn't measure +memory references, but rather memory references that could miss in the cache. +This behaviour is the same as that used by the AMD Athlon hardware counters. +It also has the benefit of simplifying the implementation -- instructions that +read and write memory can be treated like instructions that read memory.

+ +

Storing cost-centres

+Cost centres are stored in a way that makes them very cheap to lookup, which is +important since one is looked up for every original x86 instruction +executed.

+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.

+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +

+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+
+ +The translation to UCode looks like this: + +
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+
+ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the INCEIPo instruction, the argument of which gives +the byte size of the instruction (note that lazy INCEIP updating is turned off +to allow this).

+ +We can tell if an x86 instruction references memory by looking for +LDL and STL UCode instructions, and thus what kind of +cost centre is required. From this we can determine how many cost centres we +need for the basic block, and their sizes. We can then allocate them in a +single array.

+ +Consider the example code above. After the preliminary pass, we know we need +two cost centres, one iCC and one dCC. So we +allocate an array to store these which looks like this: + +

+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+
+ +(We can see now why we need tags to distinguish between the two types of cost +centres.)

+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.

+ +

Instrumentation

+The instrumentation pass has two main jobs: + +
    +
  1. Fill in the gaps in the allocated cost centres.
  2. +

  3. Add UCode to call the cache simulator for each instruction.
  4. +

+ +The instrumentation pass steps through the UCode and the cost centres in +tandem. As each original x86 instruction's UCode is processed, the appropriate +gaps in the instructions cost centre are filled in, for example: + +
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+
+ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)

+ +GCC inserts padding before the instr_size field so that it is word +aligned.

+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +

+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+
+ +Consider the first instruction's UCode. Each call is surrounded by three +PUSHL and POPL instructions to save and restore the +caller-save registers. Then the address of the instruction's cost centre is +pushed onto the stack, to be the first argument to the cache simulation +function. The address is known at this point because we are doing a +simultaneous pass through the cost centre array. This means the cost centre +lookup for each instruction is almost free (just the cost of pushing an +argument for a function call). Then the call to the cache simulation function +for non-memory-reference instructions is made (note that the +CALLMo UInstruction takes an offset into a table of predefined +functions; it is not an absolute address), and the single argument is +CLEARed from the stack.

+ +The second instruction's UCode is similar. The only difference is that, as +mentioned before, we have to pass the address of the data item referenced to +the cache simulation function too. This explains the MOVL t14, +t42 and PUSHL t42 UInstructions. (Note that the seemingly +redundant MOVing will probably be optimised away during register +allocation.)

+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.

+ + +

Handling basic block retranslations

+The above description ignores one complication. Valgrind has a limited size +cache for basic block translations; if it fills up, old translations are +discarded. If a discarded basic block is executed again, it must be +re-translated.

+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.

+ +

The cache simulation

+The cache simulation is fairly straightforward. It just tracks which memory +blocks are in the cache at the moment (it doesn't track the contents, since +that is irrelevant).

+ +The interface to the simulation is quite clean. The functions called from the +UCode contain calls to the simulation functions in the files +vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only +one function call is done per simulated x86 instruction. The file +vg_cachesim.c simply #includes the three files +containing the simulation, which makes plugging in new cache simulations is +very easy -- you just replace the three files and recompile.

+ +

Output

+Output is fairly straightforward, basically printing the cost centre for every +instruction, grouped by files and functions. Total counts (eg. total cache +accesses, total L1 misses) are calculated when traversing this structure rather +than during execution, to save time; the cache simulation functions are called +so often that even one or two extra adds can make a sizeable difference.

+ +Input file has the following format: + +

+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+
+ +Where: + + + +The contents of the "desc:" lines is printed out at the top of the summary. +This is a generic way of providing simulation specific information, eg. for +giving the cache configuration for cache simulation.

+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.

+ +The number of counts in each line and the +summary_line should not exceed the number of events in the +event_line. If the number in each line is less, +vg_annotate treats those missing as though they were a "." entry.

+ +A file_line changes the current file name. A fn_line +changes the current function name. A count_line contains counts +that pertain to the current filename/fn_name. A "fn=" file_line +and a fn_line must appear before any count_lines to +give the context of the first count_lines.

+ +Each file_line should be immediately followed by a +fn_line. "fi=" file_lines are used to switch +filenames for inlined functions; "fe=" file_lines are similar, but +are put at the end of a basic block in which the file name hasn't been switched +back to the original file name. (fi and fe lines behave the same, they are +only distinguished to help debugging.)

+ + +

Summary of performance features

+Quite a lot of work has gone into making the profiling as fast as possible. +This is a summary of the important features: + + + + +

Annotation

+Annotation is done by vg_annotate. It is a fairly straightforward Perl script +that slurps up all the cost centres, and then runs through all the chosen +source files, printing out cost centres with them. It too has been carefully +optimised. + + +

Similar work, extensions

+It would be relatively straightforward to do other simulations and obtain +line-by-line information about interesting events. A good example would be +branch prediction -- all branches could be instrumented to interact with a +branch prediction simulator, using very similar techniques to those described +above.

+ +In particular, vg_annotate would not need to change -- the file format is such +that it is not specific to the cache simulation, but could be used for any kind +of line-by-line information. The only part of vg_annotate that is specific to +the cache simulation is the name of the input file +(cachegrind.out), although it would be very simple to add an +option to control this.

+ diff --git a/coregrind/docs/manual.html b/coregrind/docs/manual.html index 5644872d00..4b6b773915 100644 --- a/coregrind/docs/manual.html +++ b/coregrind/docs/manual.html @@ -1929,7 +1929,11 @@ particular, it records: On a modern x86 machine, an L1 miss will typically cost around 10 cycles, and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program. +very useful for improving the performance of your program.

+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for optimisation and test coverage.

Please note that this is an experimental feature. Any feedback, bug-fixes, suggestions, etc, welcome. diff --git a/coregrind/docs/techdocs.html b/coregrind/docs/techdocs.html index aea95c9bbd..5bfda47ee6 100644 --- a/coregrind/docs/techdocs.html +++ b/coregrind/docs/techdocs.html @@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.

+ + +


+ +

Cache profiling

+Valgrind is a very nice platform for doing cache profiling and other kinds of +simulation, because it converts horrible x86 instructions into nice clean +RISC-like UCode. For example, for cache profiling we are interested in +instructions that read and write memory; in UCode there are only four +instructions that do this: LOAD, STORE, +FPU_R and FPU_W. By contrast, because of the x86 +addressing modes, almost every instruction can read or write memory.

+ +Most of the cache profiling machinery is in the file +vg_cachesim.c.

+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.

+ +

Cost centres

+Valgrind gathers cache profiling about every instruction executed, +individually. Each instruction has a cost centre associated with it. +There are two kinds of cost centre: one for instructions that don't reference +memory (iCC), and one for instructions that do +(idCC): + +
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+
+ +Each CC has three fields a, m1, +m2 for recording references, level 1 misses and level 2 misses. +Each of these is a 64-bit ULong -- the numbers can get very large, +ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

+ +A iCC has one CC for instruction cache accesses. A +idCC has two, one for instruction cache accesses, and one for data +cache accesses.

+ +The iCC and dCC structs also store unchanging +information about the instruction: +

+ +Note that data address is not one of the fields for idCC. This is +because for many memory-referencing instructions the data address can change +each time it's executed (eg. if it uses register-offset addressing). We have +to give this item to the cache simulation in a different way (see +Instrumentation section below). Some memory-referencing instructions do always +reference the same address, but we don't try to treat them specialy in order to +keep things simple.

+ +Also note that there is only room for recording info about one data cache +access in an idCC. So what about instructions that do a read then +a write, such as: + +

inc %(esi)
+ +In a write-allocate cache, as simulated by Valgrind, the write cannot miss, +since it immediately follows the read which will drag the block into the cache +if it's not already there. So the write access isn't really interesting, and +Valgrind doesn't record it. This means that Valgrind doesn't measure +memory references, but rather memory references that could miss in the cache. +This behaviour is the same as that used by the AMD Athlon hardware counters. +It also has the benefit of simplifying the implementation -- instructions that +read and write memory can be treated like instructions that read memory.

+ +

Storing cost-centres

+Cost centres are stored in a way that makes them very cheap to lookup, which is +important since one is looked up for every original x86 instruction +executed.

+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.

+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +

+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+
+ +The translation to UCode looks like this: + +
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+
+ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the INCEIPo instruction, the argument of which gives +the byte size of the instruction (note that lazy INCEIP updating is turned off +to allow this).

+ +We can tell if an x86 instruction references memory by looking for +LDL and STL UCode instructions, and thus what kind of +cost centre is required. From this we can determine how many cost centres we +need for the basic block, and their sizes. We can then allocate them in a +single array.

+ +Consider the example code above. After the preliminary pass, we know we need +two cost centres, one iCC and one dCC. So we +allocate an array to store these which looks like this: + +

+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+
+ +(We can see now why we need tags to distinguish between the two types of cost +centres.)

+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.

+ +

Instrumentation

+The instrumentation pass has two main jobs: + +
    +
  1. Fill in the gaps in the allocated cost centres.
  2. +

  3. Add UCode to call the cache simulator for each instruction.
  4. +

+ +The instrumentation pass steps through the UCode and the cost centres in +tandem. As each original x86 instruction's UCode is processed, the appropriate +gaps in the instructions cost centre are filled in, for example: + +
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+
+ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)

+ +GCC inserts padding before the instr_size field so that it is word +aligned.

+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +

+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+
+ +Consider the first instruction's UCode. Each call is surrounded by three +PUSHL and POPL instructions to save and restore the +caller-save registers. Then the address of the instruction's cost centre is +pushed onto the stack, to be the first argument to the cache simulation +function. The address is known at this point because we are doing a +simultaneous pass through the cost centre array. This means the cost centre +lookup for each instruction is almost free (just the cost of pushing an +argument for a function call). Then the call to the cache simulation function +for non-memory-reference instructions is made (note that the +CALLMo UInstruction takes an offset into a table of predefined +functions; it is not an absolute address), and the single argument is +CLEARed from the stack.

+ +The second instruction's UCode is similar. The only difference is that, as +mentioned before, we have to pass the address of the data item referenced to +the cache simulation function too. This explains the MOVL t14, +t42 and PUSHL t42 UInstructions. (Note that the seemingly +redundant MOVing will probably be optimised away during register +allocation.)

+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.

+ + +

Handling basic block retranslations

+The above description ignores one complication. Valgrind has a limited size +cache for basic block translations; if it fills up, old translations are +discarded. If a discarded basic block is executed again, it must be +re-translated.

+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.

+ +

The cache simulation

+The cache simulation is fairly straightforward. It just tracks which memory +blocks are in the cache at the moment (it doesn't track the contents, since +that is irrelevant).

+ +The interface to the simulation is quite clean. The functions called from the +UCode contain calls to the simulation functions in the files +vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only +one function call is done per simulated x86 instruction. The file +vg_cachesim.c simply #includes the three files +containing the simulation, which makes plugging in new cache simulations is +very easy -- you just replace the three files and recompile.

+ +

Output

+Output is fairly straightforward, basically printing the cost centre for every +instruction, grouped by files and functions. Total counts (eg. total cache +accesses, total L1 misses) are calculated when traversing this structure rather +than during execution, to save time; the cache simulation functions are called +so often that even one or two extra adds can make a sizeable difference.

+ +Input file has the following format: + +

+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+
+ +Where: + + + +The contents of the "desc:" lines is printed out at the top of the summary. +This is a generic way of providing simulation specific information, eg. for +giving the cache configuration for cache simulation.

+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.

+ +The number of counts in each line and the +summary_line should not exceed the number of events in the +event_line. If the number in each line is less, +vg_annotate treats those missing as though they were a "." entry.

+ +A file_line changes the current file name. A fn_line +changes the current function name. A count_line contains counts +that pertain to the current filename/fn_name. A "fn=" file_line +and a fn_line must appear before any count_lines to +give the context of the first count_lines.

+ +Each file_line should be immediately followed by a +fn_line. "fi=" file_lines are used to switch +filenames for inlined functions; "fe=" file_lines are similar, but +are put at the end of a basic block in which the file name hasn't been switched +back to the original file name. (fi and fe lines behave the same, they are +only distinguished to help debugging.)

+ + +

Summary of performance features

+Quite a lot of work has gone into making the profiling as fast as possible. +This is a summary of the important features: + + + + +

Annotation

+Annotation is done by vg_annotate. It is a fairly straightforward Perl script +that slurps up all the cost centres, and then runs through all the chosen +source files, printing out cost centres with them. It too has been carefully +optimised. + + +

Similar work, extensions

+It would be relatively straightforward to do other simulations and obtain +line-by-line information about interesting events. A good example would be +branch prediction -- all branches could be instrumented to interact with a +branch prediction simulator, using very similar techniques to those described +above.

+ +In particular, vg_annotate would not need to change -- the file format is such +that it is not specific to the cache simulation, but could be used for any kind +of line-by-line information. The only part of vg_annotate that is specific to +the cache simulation is the name of the input file +(cachegrind.out), although it would be very simple to add an +option to control this.

+ diff --git a/docs/manual.html b/docs/manual.html index 5644872d00..4b6b773915 100644 --- a/docs/manual.html +++ b/docs/manual.html @@ -1929,7 +1929,11 @@ particular, it records: On a modern x86 machine, an L1 miss will typically cost around 10 cycles, and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program. +very useful for improving the performance of your program.

+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for optimisation and test coverage.

Please note that this is an experimental feature. Any feedback, bug-fixes, suggestions, etc, welcome. diff --git a/docs/techdocs.html b/docs/techdocs.html index aea95c9bbd..5bfda47ee6 100644 --- a/docs/techdocs.html +++ b/docs/techdocs.html @@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.

+ + +


+ +

Cache profiling

+Valgrind is a very nice platform for doing cache profiling and other kinds of +simulation, because it converts horrible x86 instructions into nice clean +RISC-like UCode. For example, for cache profiling we are interested in +instructions that read and write memory; in UCode there are only four +instructions that do this: LOAD, STORE, +FPU_R and FPU_W. By contrast, because of the x86 +addressing modes, almost every instruction can read or write memory.

+ +Most of the cache profiling machinery is in the file +vg_cachesim.c.

+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.

+ +

Cost centres

+Valgrind gathers cache profiling about every instruction executed, +individually. Each instruction has a cost centre associated with it. +There are two kinds of cost centre: one for instructions that don't reference +memory (iCC), and one for instructions that do +(idCC): + +
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+
+ +Each CC has three fields a, m1, +m2 for recording references, level 1 misses and level 2 misses. +Each of these is a 64-bit ULong -- the numbers can get very large, +ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

+ +A iCC has one CC for instruction cache accesses. A +idCC has two, one for instruction cache accesses, and one for data +cache accesses.

+ +The iCC and dCC structs also store unchanging +information about the instruction: +

+ +Note that data address is not one of the fields for idCC. This is +because for many memory-referencing instructions the data address can change +each time it's executed (eg. if it uses register-offset addressing). We have +to give this item to the cache simulation in a different way (see +Instrumentation section below). Some memory-referencing instructions do always +reference the same address, but we don't try to treat them specialy in order to +keep things simple.

+ +Also note that there is only room for recording info about one data cache +access in an idCC. So what about instructions that do a read then +a write, such as: + +

inc %(esi)
+ +In a write-allocate cache, as simulated by Valgrind, the write cannot miss, +since it immediately follows the read which will drag the block into the cache +if it's not already there. So the write access isn't really interesting, and +Valgrind doesn't record it. This means that Valgrind doesn't measure +memory references, but rather memory references that could miss in the cache. +This behaviour is the same as that used by the AMD Athlon hardware counters. +It also has the benefit of simplifying the implementation -- instructions that +read and write memory can be treated like instructions that read memory.

+ +

Storing cost-centres

+Cost centres are stored in a way that makes them very cheap to lookup, which is +important since one is looked up for every original x86 instruction +executed.

+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.

+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +

+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+
+ +The translation to UCode looks like this: + +
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+
+ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the INCEIPo instruction, the argument of which gives +the byte size of the instruction (note that lazy INCEIP updating is turned off +to allow this).

+ +We can tell if an x86 instruction references memory by looking for +LDL and STL UCode instructions, and thus what kind of +cost centre is required. From this we can determine how many cost centres we +need for the basic block, and their sizes. We can then allocate them in a +single array.

+ +Consider the example code above. After the preliminary pass, we know we need +two cost centres, one iCC and one dCC. So we +allocate an array to store these which looks like this: + +

+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+
+ +(We can see now why we need tags to distinguish between the two types of cost +centres.)

+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.

+ +

Instrumentation

+The instrumentation pass has two main jobs: + +
    +
  1. Fill in the gaps in the allocated cost centres.
  2. +

  3. Add UCode to call the cache simulator for each instruction.
  4. +

+ +The instrumentation pass steps through the UCode and the cost centres in +tandem. As each original x86 instruction's UCode is processed, the appropriate +gaps in the instructions cost centre are filled in, for example: + +
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+
+ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)

+ +GCC inserts padding before the instr_size field so that it is word +aligned.

+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +

+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+
+ +Consider the first instruction's UCode. Each call is surrounded by three +PUSHL and POPL instructions to save and restore the +caller-save registers. Then the address of the instruction's cost centre is +pushed onto the stack, to be the first argument to the cache simulation +function. The address is known at this point because we are doing a +simultaneous pass through the cost centre array. This means the cost centre +lookup for each instruction is almost free (just the cost of pushing an +argument for a function call). Then the call to the cache simulation function +for non-memory-reference instructions is made (note that the +CALLMo UInstruction takes an offset into a table of predefined +functions; it is not an absolute address), and the single argument is +CLEARed from the stack.

+ +The second instruction's UCode is similar. The only difference is that, as +mentioned before, we have to pass the address of the data item referenced to +the cache simulation function too. This explains the MOVL t14, +t42 and PUSHL t42 UInstructions. (Note that the seemingly +redundant MOVing will probably be optimised away during register +allocation.)

+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.

+ + +

Handling basic block retranslations

+The above description ignores one complication. Valgrind has a limited size +cache for basic block translations; if it fills up, old translations are +discarded. If a discarded basic block is executed again, it must be +re-translated.

+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.

+ +

The cache simulation

+The cache simulation is fairly straightforward. It just tracks which memory +blocks are in the cache at the moment (it doesn't track the contents, since +that is irrelevant).

+ +The interface to the simulation is quite clean. The functions called from the +UCode contain calls to the simulation functions in the files +vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only +one function call is done per simulated x86 instruction. The file +vg_cachesim.c simply #includes the three files +containing the simulation, which makes plugging in new cache simulations is +very easy -- you just replace the three files and recompile.

+ +

Output

+Output is fairly straightforward, basically printing the cost centre for every +instruction, grouped by files and functions. Total counts (eg. total cache +accesses, total L1 misses) are calculated when traversing this structure rather +than during execution, to save time; the cache simulation functions are called +so often that even one or two extra adds can make a sizeable difference.

+ +Input file has the following format: + +

+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+
+ +Where: + + + +The contents of the "desc:" lines is printed out at the top of the summary. +This is a generic way of providing simulation specific information, eg. for +giving the cache configuration for cache simulation.

+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.

+ +The number of counts in each line and the +summary_line should not exceed the number of events in the +event_line. If the number in each line is less, +vg_annotate treats those missing as though they were a "." entry.

+ +A file_line changes the current file name. A fn_line +changes the current function name. A count_line contains counts +that pertain to the current filename/fn_name. A "fn=" file_line +and a fn_line must appear before any count_lines to +give the context of the first count_lines.

+ +Each file_line should be immediately followed by a +fn_line. "fi=" file_lines are used to switch +filenames for inlined functions; "fe=" file_lines are similar, but +are put at the end of a basic block in which the file name hasn't been switched +back to the original file name. (fi and fe lines behave the same, they are +only distinguished to help debugging.)

+ + +

Summary of performance features

+Quite a lot of work has gone into making the profiling as fast as possible. +This is a summary of the important features: + + + + +

Annotation

+Annotation is done by vg_annotate. It is a fairly straightforward Perl script +that slurps up all the cost centres, and then runs through all the chosen +source files, printing out cost centres with them. It too has been carefully +optimised. + + +

Similar work, extensions

+It would be relatively straightforward to do other simulations and obtain +line-by-line information about interesting events. A good example would be +branch prediction -- all branches could be instrumented to interact with a +branch prediction simulator, using very similar techniques to those described +above.

+ +In particular, vg_annotate would not need to change -- the file format is such +that it is not specific to the cache simulation, but could be used for any kind +of line-by-line information. The only part of vg_annotate that is specific to +the cache simulation is the name of the input file +(cachegrind.out), although it would be very simple to add an +option to control this.

+ diff --git a/memcheck/docs/manual.html b/memcheck/docs/manual.html index 5644872d00..4b6b773915 100644 --- a/memcheck/docs/manual.html +++ b/memcheck/docs/manual.html @@ -1929,7 +1929,11 @@ particular, it records: On a modern x86 machine, an L1 miss will typically cost around 10 cycles, and an L2 miss can cost as much as 200 cycles. Detailed cache profiling can be -very useful for improving the performance of your program. +very useful for improving the performance of your program.

+ +Also, since one instruction cache read is performed per instruction executed, +you can find out how many instructions are executed per line, which can be +useful for optimisation and test coverage.

Please note that this is an experimental feature. Any feedback, bug-fixes, suggestions, etc, welcome. diff --git a/memcheck/docs/techdocs.html b/memcheck/docs/techdocs.html index aea95c9bbd..5bfda47ee6 100644 --- a/memcheck/docs/techdocs.html +++ b/memcheck/docs/techdocs.html @@ -2108,5 +2108,415 @@ Valgrind into an even-more-useful tool.

+ + +


+ +

Cache profiling

+Valgrind is a very nice platform for doing cache profiling and other kinds of +simulation, because it converts horrible x86 instructions into nice clean +RISC-like UCode. For example, for cache profiling we are interested in +instructions that read and write memory; in UCode there are only four +instructions that do this: LOAD, STORE, +FPU_R and FPU_W. By contrast, because of the x86 +addressing modes, almost every instruction can read or write memory.

+ +Most of the cache profiling machinery is in the file +vg_cachesim.c.

+ +These notes are a somewhat haphazard guide to how Valgrind's cache profiling +works.

+ +

Cost centres

+Valgrind gathers cache profiling about every instruction executed, +individually. Each instruction has a cost centre associated with it. +There are two kinds of cost centre: one for instructions that don't reference +memory (iCC), and one for instructions that do +(idCC): + +
+typedef struct _CC {
+   ULong a;
+   ULong m1;
+   ULong m2;
+} CC;
+
+typedef struct _iCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I;
+} iCC;
+   
+typedef struct _idCC {
+   /* word 1 */
+   UChar tag;
+   UChar instr_size;
+   UChar data_size;
+
+   /* words 2+ */
+   Addr instr_addr;
+   CC I; 
+   CC D; 
+} idCC; 
+
+ +Each CC has three fields a, m1, +m2 for recording references, level 1 misses and level 2 misses. +Each of these is a 64-bit ULong -- the numbers can get very large, +ie. greater than 4.2 billion allowed by a 32-bit unsigned int.

+ +A iCC has one CC for instruction cache accesses. A +idCC has two, one for instruction cache accesses, and one for data +cache accesses.

+ +The iCC and dCC structs also store unchanging +information about the instruction: +

+ +Note that data address is not one of the fields for idCC. This is +because for many memory-referencing instructions the data address can change +each time it's executed (eg. if it uses register-offset addressing). We have +to give this item to the cache simulation in a different way (see +Instrumentation section below). Some memory-referencing instructions do always +reference the same address, but we don't try to treat them specialy in order to +keep things simple.

+ +Also note that there is only room for recording info about one data cache +access in an idCC. So what about instructions that do a read then +a write, such as: + +

inc %(esi)
+ +In a write-allocate cache, as simulated by Valgrind, the write cannot miss, +since it immediately follows the read which will drag the block into the cache +if it's not already there. So the write access isn't really interesting, and +Valgrind doesn't record it. This means that Valgrind doesn't measure +memory references, but rather memory references that could miss in the cache. +This behaviour is the same as that used by the AMD Athlon hardware counters. +It also has the benefit of simplifying the implementation -- instructions that +read and write memory can be treated like instructions that read memory.

+ +

Storing cost-centres

+Cost centres are stored in a way that makes them very cheap to lookup, which is +important since one is looked up for every original x86 instruction +executed.

+ +Valgrind does JIT translations at the basic block level, and cost centres are +also setup and stored at the basic block level. By doing things carefully, we +store all the cost centres for a basic block in a contiguous array, and lookup +comes almost for free.

+ +Consider this part of a basic block (for exposition purposes, pretend it's an +entire basic block): + +

+movl $0x0,%eax
+movl $0x99, -4(%ebp)
+
+ +The translation to UCode looks like this: + +
+MOVL      $0x0, t20
+PUTL      t20, %EAX
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+STL       t18, (t14)
+INCEIPo   $7
+
+ +The first step is to allocate the cost centres. This requires a preliminary +pass to count how many x86 instructions were in the basic block, and their +types (and thus sizes). UCode translations for single x86 instructions are +delimited by the INCEIPo instruction, the argument of which gives +the byte size of the instruction (note that lazy INCEIP updating is turned off +to allow this).

+ +We can tell if an x86 instruction references memory by looking for +LDL and STL UCode instructions, and thus what kind of +cost centre is required. From this we can determine how many cost centres we +need for the basic block, and their sizes. We can then allocate them in a +single array.

+ +Consider the example code above. After the preliminary pass, we know we need +two cost centres, one iCC and one dCC. So we +allocate an array to store these which looks like this: + +

+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+
+|(uninit)|      tag         (1 byte)
+|(uninit)|      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|(uninit)|      instr_addr  (4 bytes)
+|(uninit)|      I.a         (8 bytes)
+|(uninit)|      I.m1        (8 bytes)
+|(uninit)|      I.m2        (8 bytes)
+|(uninit)|      D.a         (8 bytes)
+|(uninit)|      D.m1        (8 bytes)
+|(uninit)|      D.m2        (8 bytes)
+
+ +(We can see now why we need tags to distinguish between the two types of cost +centres.)

+ +We also record the size of the array. We look up the debug info of the first +instruction in the basic block, and then stick the array into a table indexed +by filename and function name. This makes it easy to dump the information +quickly to file at the end.

+ +

Instrumentation

+The instrumentation pass has two main jobs: + +
    +
  1. Fill in the gaps in the allocated cost centres.
  2. +

  3. Add UCode to call the cache simulator for each instruction.
  4. +

+ +The instrumentation pass steps through the UCode and the cost centres in +tandem. As each original x86 instruction's UCode is processed, the appropriate +gaps in the instructions cost centre are filled in, for example: + +
+|INSTR_CC|      tag         (1 byte)
+|5       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr1 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+
+|READ_CC |      tag         (1 byte)
+|7       |      instr_size  (1 bytes)
+|(uninit)|      (padding)   (2 bytes)
+|i_addr2 |      instr_addr  (4 bytes)
+|0       |      I.a         (8 bytes)
+|0       |      I.m1        (8 bytes)
+|0       |      I.m2        (8 bytes)
+|0       |      D.a         (8 bytes)
+|0       |      D.m1        (8 bytes)
+|0       |      D.m2        (8 bytes)
+
+ +(Note that this step is not performed if a basic block is re-translated; see +here for more information.)

+ +GCC inserts padding before the instr_size field so that it is word +aligned.

+ +The instrumentation added to call the cache simulation function looks like this +(instrumentation is indented to distinguish it from the original UCode): + +

+MOVL      $0x0, t20
+PUTL      t20, %EAX
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  MOVL      $0x4091F8A4, t46  # address of 1st CC
+  PUSHL     t46
+  CALLMo    $0x12             # second cachesim function
+  CLEARo    $0x4
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $5
+
+LEA1L     -4(t4), t14
+MOVL      $0x99, t18
+  MOVL      t14, t42
+STL       t18, (t14)
+  PUSHL     %eax
+  PUSHL     %ecx
+  PUSHL     %edx
+  PUSHL     t42
+  MOVL      $0x4091F8C4, t44  # address of 2nd CC
+  PUSHL     t44
+  CALLMo    $0x13             # second cachesim function
+  CLEARo    $0x8
+  POPL      %edx
+  POPL      %ecx
+  POPL      %eax
+INCEIPo   $7
+
+ +Consider the first instruction's UCode. Each call is surrounded by three +PUSHL and POPL instructions to save and restore the +caller-save registers. Then the address of the instruction's cost centre is +pushed onto the stack, to be the first argument to the cache simulation +function. The address is known at this point because we are doing a +simultaneous pass through the cost centre array. This means the cost centre +lookup for each instruction is almost free (just the cost of pushing an +argument for a function call). Then the call to the cache simulation function +for non-memory-reference instructions is made (note that the +CALLMo UInstruction takes an offset into a table of predefined +functions; it is not an absolute address), and the single argument is +CLEARed from the stack.

+ +The second instruction's UCode is similar. The only difference is that, as +mentioned before, we have to pass the address of the data item referenced to +the cache simulation function too. This explains the MOVL t14, +t42 and PUSHL t42 UInstructions. (Note that the seemingly +redundant MOVing will probably be optimised away during register +allocation.)

+ +Note that instead of storing unchanging information about each instruction +(instruction size, data size, etc) in its cost centre, we could have passed in +these arguments to the simulation function. But this would slow the calls down +(two or three extra arguments pushed onto the stack). Also it would bloat the +UCode instrumentation by amounts similar to the space required for them in the +cost centre; bloated UCode would also fill the translation cache more quickly, +requiring more translations for large programs and slowing them down more.

+ + +

Handling basic block retranslations

+The above description ignores one complication. Valgrind has a limited size +cache for basic block translations; if it fills up, old translations are +discarded. If a discarded basic block is executed again, it must be +re-translated.

+ +However, we can't use this approach for profiling -- we can't throw away cost +centres for instructions in the middle of execution! So when a basic block is +translated, we first look for its cost centre array in the hash table. If +there is no cost centre array, it must be the first translation, so we proceed +as described above. But if there is a cost centre array already, it must be a +retranslation. In this case, we skip the cost centre allocation and +initialisation steps, but still do the UCode instrumentation step.

+ +

The cache simulation

+The cache simulation is fairly straightforward. It just tracks which memory +blocks are in the cache at the moment (it doesn't track the contents, since +that is irrelevant).

+ +The interface to the simulation is quite clean. The functions called from the +UCode contain calls to the simulation functions in the files +vg_cachesim_{I1,D1,L2}.c; these calls are inlined so that only +one function call is done per simulated x86 instruction. The file +vg_cachesim.c simply #includes the three files +containing the simulation, which makes plugging in new cache simulations is +very easy -- you just replace the three files and recompile.

+ +

Output

+Output is fairly straightforward, basically printing the cost centre for every +instruction, grouped by files and functions. Total counts (eg. total cache +accesses, total L1 misses) are calculated when traversing this structure rather +than during execution, to save time; the cache simulation functions are called +so often that even one or two extra adds can make a sizeable difference.

+ +Input file has the following format: + +

+file         ::= desc_line* cmd_line events_line data_line+ summary_line
+desc_line    ::= "desc:" ws? non_nl_string
+cmd_line     ::= "cmd:" ws? cmd
+events_line  ::= "events:" ws? (event ws)+
+data_line    ::= file_line | fn_line | count_line
+file_line    ::= ("fl=" | "fi=" | "fe=") filename
+fn_line      ::= "fn=" fn_name
+count_line   ::= line_num ws? (count ws)+
+summary_line ::= "summary:" ws? (count ws)+
+count        ::= num | "."
+
+ +Where: + + + +The contents of the "desc:" lines is printed out at the top of the summary. +This is a generic way of providing simulation specific information, eg. for +giving the cache configuration for cache simulation.

+ +Counts can be "." to represent "N/A", eg. the number of write misses for an +instruction that doesn't write to memory.

+ +The number of counts in each line and the +summary_line should not exceed the number of events in the +event_line. If the number in each line is less, +vg_annotate treats those missing as though they were a "." entry.

+ +A file_line changes the current file name. A fn_line +changes the current function name. A count_line contains counts +that pertain to the current filename/fn_name. A "fn=" file_line +and a fn_line must appear before any count_lines to +give the context of the first count_lines.

+ +Each file_line should be immediately followed by a +fn_line. "fi=" file_lines are used to switch +filenames for inlined functions; "fe=" file_lines are similar, but +are put at the end of a basic block in which the file name hasn't been switched +back to the original file name. (fi and fe lines behave the same, they are +only distinguished to help debugging.)

+ + +

Summary of performance features

+Quite a lot of work has gone into making the profiling as fast as possible. +This is a summary of the important features: + + + + +

Annotation

+Annotation is done by vg_annotate. It is a fairly straightforward Perl script +that slurps up all the cost centres, and then runs through all the chosen +source files, printing out cost centres with them. It too has been carefully +optimised. + + +

Similar work, extensions

+It would be relatively straightforward to do other simulations and obtain +line-by-line information about interesting events. A good example would be +branch prediction -- all branches could be instrumented to interact with a +branch prediction simulator, using very similar techniques to those described +above.

+ +In particular, vg_annotate would not need to change -- the file format is such +that it is not specific to the cache simulation, but could be used for any kind +of line-by-line information. The only part of vg_annotate that is specific to +the cache simulation is the name of the input file +(cachegrind.out), although it would be very simple to add an +option to control this.

+