From: Nicholas Nethercote Date: Sat, 21 Oct 2006 22:25:56 +0000 (+0000) Subject: Removed the file format description from cg_annotate.in, because it's in the X-Git-Tag: svn/VALGRIND_3_2_2~64 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=5bd0ea9e5a7982a4678b5bcea4757d684ef69ca0;p=thirdparty%2Fvalgrind.git Removed the file format description from cg_annotate.in, because it's in the Cachegrind docs. Removed the Cachegrind tech docs, because they're so out of date to be useless. My PhD dissertation gives a much better description of how Cachegrind works. (I mentioned this in the Cachegrind user manual.) The only still-useful part of Cachegrind's tech docs, the output file format description, I moved into the Cachegrind user manual. MERGED FROM TRUNK git-svn-id: svn://svn.valgrind.org/valgrind/branches/VALGRIND_3_2_BRANCH@6333 --- diff --git a/cachegrind/cg_annotate.in b/cachegrind/cg_annotate.in index fe7a27ec71..811e5a8466 100644 --- a/cachegrind/cg_annotate.in +++ b/cachegrind/cg_annotate.in @@ -29,47 +29,8 @@ #---------------------------------------------------------------------------- # The file format is simple, basically printing the cost centre for every -# source line, grouped by files and functions: -# -# file ::= desc_line* cmd_line events_line data_line+ summary_line -# desc_line ::= "desc:" ws? non_nl_string -# cmd_line ::= "cmd:" ws? cmd -# events_line ::= "events:" ws? (event ws)+ -# data_line ::= file_line | fn_line | count_line -# file_line ::= "fl=" filename -# fn_line ::= "fn=" fn_name -# count_line ::= line_num ws? (count ws)+ -# summary_line ::= "summary:" ws? (count ws)+ -# count ::= num | "." -# -# where -# 'non_nl_string' is any string not containing a newline. -# 'cmd' is a string holding the command line of the profiled program. -# 'filename' and 'fn_name' are strings. -# 'num' and 'line_num' are decimal integers. -# 'ws' is whitespace. -# -# The contents of the "desc:" lines are printed out at the top -# of the summary. This is a generic way of providing simulation -# specific information, eg. for giving the cache configuration for -# cache simulation. -# -# More than one line of info can be presented for each file/fn/line number. -# In such cases, the counts for the named events will be accumulated. -# -# Counts can be "." to represent zero. This makes the files easier to read. -# -# The number of counts in each 'line' and the 'summary_line' should not exceed -# the number of events in the 'event_line'. If the number in each 'line' is -# less, cg_annotate treats those missing as though they were a "." entry. -# -# A 'file_line' changes the current file name. A 'fn_line' changes the -# current function name. A 'count_line' contains counts that pertain to the -# current filename/fn_name. A 'file_line' and a 'fn_line' must appear -# before any 'count_line's to give the context of the first 'count_line'. -# -# Each 'file_line' will normally be immediately followed by a 'fn_line'. -# But it doesn't have to be. +# source line, grouped by files and functions. The details are in +# Cachegrind's manual. #---------------------------------------------------------------------------- # Performance improvements record, using cachegrind.out for cacheprof, doing no diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml index 2e49b147dc..1d2f123329 100644 --- a/cachegrind/docs/cg-manual.xml +++ b/cachegrind/docs/cg-manual.xml @@ -6,12 +6,6 @@ Cachegrind: a cache profiler -Detailed technical documentation on how Cachegrind works is -available in . If you only want to know -how to use it, this is the page you need to -read. - - Cache profiling @@ -1018,17 +1012,100 @@ useful. + + + +Implementation details +This section talks about details you don't need to know about in order to +use Cachegrind, but may be of interest to some people. -Todo +How Cachegrind works +The best reference for understanding how Cachegrind works is chapter 3 of +"Dynamic Binary Analysis and Instrumentation", by Nicholas Nethercote. It +is available on the publications page of the Valgrind website. + + +Cachegrind output file format +The file format is fairly straightforward, basically giving the +cost centre for every line, grouped by files and +functions. Total counts (eg. total cache accesses, total L1 +misses) are calculated when traversing this structure rather than +during execution, to save time; the cache simulation functions +are called so often that even one or two extra adds can make a +sizeable difference. + +The file format: + + +Where: - Program start-up/shut-down calls a lot of functions - that aren't interesting and just complicate the output. - Would be nice to exclude these somehow. + non_nl_string is any + string not containing a newline. + + + cmd is a string holding the + command line of the profiled program. + + + filename and + fn_name are strings. - + + num and + line_num are decimal + numbers. + + + ws is whitespace. + + + +The contents of the "desc:" lines are printed out at the top +of the summary. This is a generic way of providing simulation +specific information, eg. for giving the cache configuration for +cache simulation. + +More than one line of info can be presented for each file/fn/line number. +In such cases, the counts for the named events will be accumulated. + +Counts can be "." to represent zero. This makes the files easier to +read. + +The number of counts in each +line and the +summary_line should not exceed +the number of events in the +event_line. If the number in +each line is less, cg_annotate +treats those missing as though they were a "." entry. + +A file_line changes the +current file name. A fn_line +changes the current function name. A +count_line contains counts that +pertain to the current filename/fn_name. A "fn=" +file_line and a +fn_line must appear before any +count_lines to give the context +of the first count_lines. + +Each file_line will normally be +immediately followed by a fn_line. But it +doesn't have to be. + diff --git a/cachegrind/docs/cg-tech-docs.xml b/cachegrind/docs/cg-tech-docs.xml deleted file mode 100644 index bc49a7ae9a..0000000000 --- a/cachegrind/docs/cg-tech-docs.xml +++ /dev/null @@ -1,563 +0,0 @@ - - - - - -How Cachegrind works - - -Cache profiling - -[Note: this document is now very old, and a lot of its contents are out -of date, and misleading.] - -Valgrind is a very nice platform for doing cache profiling -and other kinds of simulation, because it converts horrible x86 -instructions into nice clean RISC-like UCode. For example, for -cache profiling we are interested in instructions that read and -write memory; in UCode there are only four instructions that do -this: LOAD, -STORE, -FPU_R and -FPU_W. By contrast, because of -the x86 addressing modes, almost every instruction can read or -write memory. - -Most of the cache profiling machinery is in the file -vg_cachesim.c. - -These notes are a somewhat haphazard guide to how -Valgrind's cache profiling works. - - - - - -Cost centres - -Valgrind gathers cache profiling about every instruction -executed, individually. Each instruction has a cost -centre associated with it. There are two kinds of cost -centre: one for instructions that don't reference memory -(iCC), and one for instructions -that do (idCC): - - - -Each CC has three fields -a, -m1, -m2 for recording references, -level 1 misses and level 2 misses. Each of these is a 64-bit -ULong -- the numbers can get -very large, ie. greater than 4.2 billion allowed by a 32-bit -unsigned int. - -A iCC has one -CC for instruction cache -accesses. A idCC has two, one -for instruction cache accesses, and one for data cache -accesses. - -The iCC and -dCC structs also store -unchanging information about the instruction: - - - An instruction-type identification tag (explained - below) - - - Instruction size - - - Data reference size - (idCC only) - - - Instruction address - - - -Note that data address is not one of the fields for -idCC. This is because for many -memory-referencing instructions the data address can change each -time it's executed (eg. if it uses register-offset addressing). -We have to give this item to the cache simulation in a different -way (see Instrumentation section below). Some memory-referencing -instructions do always reference the same address, but we don't -try to treat them specialy in order to keep things simple. - -Also note that there is only room for recording info about -one data cache access in an -idCC. So what about -instructions that do a read then a write, such as: - - -In a write-allocate cache, as simulated by Valgrind, the -write cannot miss, since it immediately follows the read which -will drag the block into the cache if it's not already there. So -the write access isn't really interesting, and Valgrind doesn't -record it. This means that Valgrind doesn't measure memory -references, but rather memory references that could miss in the -cache. This behaviour is the same as that used by the AMD Athlon -hardware counters. It also has the benefit of simplifying the -implementation -- instructions that read and write memory can be -treated like instructions that read memory. - - - - - -Storing cost-centres - -Cost centres are stored in a way that makes them very cheap -to lookup, which is important since one is looked up for every -original x86 instruction executed. - -Valgrind does JIT translations at the basic block level, -and cost centres are also setup and stored at the basic block -level. By doing things carefully, we store all the cost centres -for a basic block in a contiguous array, and lookup comes almost -for free. - -Consider this part of a basic block (for exposition -purposes, pretend it's an entire basic block): - - -The translation to UCode looks like this: - - -The first step is to allocate the cost centres. This -requires a preliminary pass to count how many x86 instructions -were in the basic block, and their types (and thus sizes). UCode -translations for single x86 instructions are delimited by the -INCEIPo instruction, the -argument of which gives the byte size of the instruction (note -that lazy INCEIP updating is turned off to allow this). - -We can tell if an x86 instruction references memory by -looking for LDL and -STL UCode instructions, and thus -what kind of cost centre is required. From this we can determine -how many cost centres we need for the basic block, and their -sizes. We can then allocate them in a single array. - -Consider the example code above. After the preliminary -pass, we know we need two cost centres, one -iCC and one -dCC. So we allocate an array to -store these which looks like this: - - - -(We can see now why we need tags to distinguish between the -two types of cost centres.) - -We also record the size of the array. We look up the debug -info of the first instruction in the basic block, and then stick -the array into a table indexed by filename and function name. -This makes it easy to dump the information quickly to file at the -end. - - - - - -Instrumentation - -The instrumentation pass has two main jobs: - - - - Fill in the gaps in the allocated cost centres. - - - Add UCode to call the cache simulator for each - instruction. - - - -The instrumentation pass steps through the UCode and the -cost centres in tandem. As each original x86 instruction's UCode -is processed, the appropriate gaps in the instructions cost -centre are filled in, for example: - - - -(Note that this step is not performed if a basic block is -re-translated; see for -more information.) - -GCC inserts padding before the -instr_size field so that it is -word aligned. - -The instrumentation added to call the cache simulation -function looks like this (instrumentation is indented to -distinguish it from the original UCode): - - - -Consider the first instruction's UCode. Each call is -surrounded by three PUSHL and -POPL instructions to save and -restore the caller-save registers. Then the address of the -instruction's cost centre is pushed onto the stack, to be the -first argument to the cache simulation function. The address is -known at this point because we are doing a simultaneous pass -through the cost centre array. This means the cost centre lookup -for each instruction is almost free (just the cost of pushing an -argument for a function call). Then the call to the cache -simulation function for non-memory-reference instructions is made -(note that the CALLMo -UInstruction takes an offset into a table of predefined -functions; it is not an absolute address), and the single -argument is CLEARed from the -stack. - -The second instruction's UCode is similar. The only -difference is that, as mentioned before, we have to pass the -address of the data item referenced to the cache simulation -function too. This explains the MOVL t14, -t42 and PUSHL -t42 UInstructions. (Note that the seemingly -redundant MOVing will probably -be optimised away during register allocation.) - -Note that instead of storing unchanging information about -each instruction (instruction size, data size, etc) in its cost -centre, we could have passed in these arguments to the simulation -function. But this would slow the calls down (two or three extra -arguments pushed onto the stack). Also it would bloat the UCode -instrumentation by amounts similar to the space required for them -in the cost centre; bloated UCode would also fill the translation -cache more quickly, requiring more translations for large -programs and slowing them down more. - - - - - -Handling basic block retranslations - -The above description ignores one complication. Valgrind -has a limited size cache for basic block translations; if it -fills up, old translations are discarded. If a discarded basic -block is executed again, it must be re-translated. - -However, we can't use this approach for profiling -- we -can't throw away cost centres for instructions in the middle of -execution! So when a basic block is translated, we first look -for its cost centre array in the hash table. If there is no cost -centre array, it must be the first translation, so we proceed as -described above. But if there is a cost centre array already, it -must be a retranslation. In this case, we skip the cost centre -allocation and initialisation steps, but still do the UCode -instrumentation step. - - - - - - -The cache simulation - -The cache simulation is fairly straightforward. It just -tracks which memory blocks are in the cache at the moment (it -doesn't track the contents, since that is irrelevant). - -The interface to the simulation is quite clean. The -functions called from the UCode contain calls to the simulation -functions in the files -vg_cachesim_{I1,D1,L2}.c; these calls are -inlined so that only one function call is done per simulated x86 -instruction. The file vg_cachesim.c simply -#includes the three files -containing the simulation, which makes plugging in new cache -simulations is very easy -- you just replace the three files and -recompile. - - - - - -Output - -Output is fairly straightforward, basically printing the -cost centre for every instruction, grouped by files and -functions. Total counts (eg. total cache accesses, total L1 -misses) are calculated when traversing this structure rather than -during execution, to save time; the cache simulation functions -are called so often that even one or two extra adds can make a -sizeable difference. - -Input file has the following format: - - -Where: - - - non_nl_string is any - string not containing a newline. - - - cmd is a command line - invocation. - - - filename and - fn_name can be anything. - - - num and - line_num are decimal - numbers. - - - ws is whitespace. - - - nl is a newline. - - - - -The contents of the "desc:" lines is printed out at the top -of the summary. This is a generic way of providing simulation -specific information, eg. for giving the cache configuration for -cache simulation. - -Counts can be "." to represent "N/A", eg. the number of -write misses for an instruction that doesn't write to -memory. - -The number of counts in each -line and the -summary_line should not exceed -the number of events in the -event_line. If the number in -each line is less, cg_annotate -treats those missing as though they were a "." entry. - -A file_line changes the -current file name. A fn_line -changes the current function name. A -count_line contains counts that -pertain to the current filename/fn_name. A "fn=" -file_line and a -fn_line must appear before any -count_lines to give the context -of the first count_lines. - -Each file_line should be -immediately followed by a -fn_line. "fi=" -file_lines are used to switch -filenames for inlined functions; "fe=" -file_lines are similar, but are -put at the end of a basic block in which the file name hasn't -been switched back to the original file name. (fi and fe lines -behave the same, they are only distinguished to help -debugging.) - - - - - - -Summary of performance features - -Quite a lot of work has gone into making the profiling as -fast as possible. This is a summary of the important -features: - - - - - The basic block-level cost centre storage allows almost - free cost centre lookup. - - - - Only one function call is made per instruction - simulated; even this accounts for a sizeable percentage of - execution time, but it seems unavoidable if we want - flexibility in the cache simulator. - - - - Unchanging information about an instruction is stored - in its cost centre, avoiding unnecessary argument pushing, - and minimising UCode instrumentation bloat. - - - - Summary counts are calculated at the end, rather than - during execution. - - - - The cachegrind.out - output files can contain huge amounts of information; file - format was carefully chosen to minimise file sizes. - - - - - - - - - -Annotation - -Annotation is done by cg_annotate. It is a fairly -straightforward Perl script that slurps up all the cost centres, -and then runs through all the chosen source files, printing out -cost centres with them. It too has been carefully optimised. - - - - - - -Similar work, extensions - -It would be relatively straightforward to do other -simulations and obtain line-by-line information about interesting -events. A good example would be branch prediction -- all -branches could be instrumented to interact with a branch -prediction simulator, using very similar techniques to those -described above. - -In particular, cg_annotate would not need to change -- the -file format is such that it is not specific to the cache -simulation, but could be used for any kind of line-by-line -information. The only part of cg_annotate that is specific to -the cache simulation is the name of the input file -(cachegrind.out), although it -would be very simple to add an option to control this. - - - - diff --git a/docs/xml/tech-docs.xml b/docs/xml/tech-docs.xml index 8ce8dfdf0d..5bb7702852 100644 --- a/docs/xml/tech-docs.xml +++ b/docs/xml/tech-docs.xml @@ -19,8 +19,6 @@ -