From: Nicholas Nethercote Date: Thu, 20 Apr 2023 21:20:11 +0000 (+1000) Subject: Rewrite Cachegrind docs. X-Git-Tag: VALGRIND_3_21_0~30 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=c2e62127ad8a9b71c4abf4b166ad545988490c32;p=thirdparty%2Fvalgrind.git Rewrite Cachegrind docs. For all the changes I've made recently. And also various other changes that occurred over the past 20 years that didn't previously make it into the docs. Also, this change de-emphasises the cache and branch simulation aspect, because they're no longer that useful. Instead it emphasises the precision and reproducibility of instruction count profiling. --- diff --git a/cachegrind/docs/cg-manual.xml b/cachegrind/docs/cg-manual.xml index 92fe086824..35d6a412e3 100644 --- a/cachegrind/docs/cg-manual.xml +++ b/cachegrind/docs/cg-manual.xml @@ -5,167 +5,117 @@ -Cachegrind: a cache and branch-prediction profiler +Cachegrind: a high-precision tracing profiler -To use this tool, you must specify - on the -Valgrind command line. + +To use this tool, specify on the Valgrind +command line. + Overview -Cachegrind simulates how your program interacts with a machine's cache -hierarchy and (optionally) branch predictor. It simulates a machine with -independent first-level instruction and data caches (I1 and D1), backed by a -unified second-level cache (L2). This exactly matches the configuration of -many modern machines. - -However, some modern machines have three or four levels of cache. For these -machines (in the cases where Cachegrind can auto-detect the cache -configuration) Cachegrind simulates the first-level and last-level caches. -The reason for this choice is that the last-level cache has the most influence on -runtime, as it masks accesses to main memory. Furthermore, the L1 caches -often have low associativity, so simulating them can detect cases where the -code interacts badly with this cache (eg. traversing a matrix column-wise -with the row length being a power of 2). - -Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) -caches. - -Cachegrind gathers the following statistics (abbreviations used for each statistic -is given in parentheses): +Cachegrind is a high-precision tracing profiler. It runs slowly, but collects +precise and reproducible profiling data. It can merge and diff data from +different runs. To expand on these characteristics: + + - I cache reads (Ir, - which equals the number of instructions executed), - I1 cache read misses (I1mr) and - LL cache instruction read misses (ILmr). - - - - D cache reads (Dr, which - equals the number of memory reads), - D1 cache read misses (D1mr), and - LL cache data read misses (DLmr). - - - - D cache writes (Dw, which equals - the number of memory writes), - D1 cache write misses (D1mw), and - LL cache data write misses (DLmw). - - - - Conditional branches executed (Bc) and - conditional branches mispredicted (Bcm). + + Precise. Cachegrind measures the exact number of + instructions executed by your program, not an approximation. Furthermore, + it presents the gathered data at the file, function, and line level. This + is different to many other profilers that measure approximate execution + time, using sampling, and only at the function level. + - Indirect branches executed (Bi) and - indirect branches mispredicted (Bim). + + Reproducible. In general, execution time is a better + metric than instruction counts because it's what users perceive. However, + execution time often has high variability. When running the exact same + program on the exact same input multiple times, execution time might vary + by several percent. Furthermore, small changes in a program can change its + memory layout and have even larger effects on runtime. In contrast, + instruction counts are highly reproducible; for some programs they are + perfectly reproducible. This means the effects of small changes in a + program can be measured with high precision. -Note that D1 total accesses is given by -D1mr + -D1mw, and that LL total -accesses is given by ILmr + -DLmr + -DLmw. + +For these reasons, Cachegrind is an excellent complement to time-based profilers. -These statistics are presented for the entire program and for each -function in the program. You can also annotate each line of source code in -the program with the counts that were caused directly by it. - -On a modern machine, an L1 miss will typically cost -around 10 cycles, an LL miss can cost as much as 200 -cycles, and a mispredicted branch costs in the region of 10 -to 30 cycles. Detailed cache and branch profiling can be very useful -for understanding how your program interacts with the machine and thus how -to make it faster. + +Cachegrind can annotate programs written in any language, so long as debug info +is present to map machine code back to the original source code. Cachegrind has +been used successfully on programs written in C, C++, Rust, and assembly. + -Also, since one instruction cache read is performed per -instruction executed, you can find out how many instructions are -executed per line, which can be useful for traditional profiling. + +Cachegrind can also simulate how your program interacts with a machine's cache +hierarchy and branch predictor. This simulation was the original motivation for +the tool, hence its name. However, the simulations are basic and unlikely to +reflect the behaviour of a modern machine. For this reason they are off by +default. If you really want cache and branch information, a profiler like +perf that accesses hardware counters is a +better choice. + - -Using Cachegrind, cg_annotate and cg_merge + xreflabel="Using Cachegrind and cg_annotate"> +Using Cachegrind and cg_annotate + + +First, as for normal Valgrind use, you should compile with debugging info (the + option in most compilers). But by contrast with normal +Valgrind use, you probably do want to turn optimisation on, since you should +profile your program as it will be normally run. + -First off, as for normal Valgrind use, you probably want to -compile with debugging info (the - option). But by contrast with -normal Valgrind use, you probably do want to turn -optimisation on, since you should profile your program as it will -be normally run. + +Second, run Cachegrind itself to gather the profiling data. + -Then, you need to run Cachegrind itself to gather the profiling -information, and then run cg_annotate to get a detailed presentation of that -information. As an optional intermediate step, you can use cg_merge to sum -together the outputs of multiple Cachegrind runs into a single file which -you then use as the input for cg_annotate. Alternatively, you can use -cg_diff to difference the outputs of two Cachegrind runs into a single file -which you then use as the input for cg_annotate. + +Third, run cg_annotate to get a detailed presentation of that data. cg_annotate +can combine the results of multiple Cachegrind output files. It can also +perform a diff between two Cachegrind output files. + Running Cachegrind -To run Cachegrind on a program prog, run: + +To run Cachegrind on a program prog, run: + -The program will execute (slowly). Upon completion, -summary statistics that look like this will be printed: + +The program will execute (slowly). Upon completion, summary statistics that +look like this will be printed: + - -Cache accesses for instruction fetches are summarised -first, giving the number of fetches made (this is the number of -instructions executed, which can be useful to know in its own -right), the number of I1 misses, and the number of LL instruction -(LLi) misses. - -Cache accesses for data follow. The information is similar -to that of the instruction fetches, except that the values are -also shown split between reads and writes (note each row's -rd and -wr values add up to the row's -total). - -Combined instruction and data figures for the LL cache -follow that. Note that the LL miss rate is computed relative to the total -number of memory accesses, not the number of L1 misses. I.e. it is -(ILmr + DLmr + DLmw) / (Ir + Dr + Dw) -not -(ILmr + DLmr + DLmw) / (I1mr + D1mr + D1mw) - - -Branch prediction statistics are not collected by default. -To do so, add the option . +==17942== I refs: 8,195,070 +]]> + + +The I refs number is short for "Instruction +cache references", which is equivalent to "instructions executed". If you +enable the cache and/or branch simulation, additional counts will be shown. + @@ -173,691 +123,791 @@ To do so, add the option . Output File -As well as printing summary information, Cachegrind also writes -more detailed profiling information to a file. By default this file is named -cachegrind.out.<pid> (where -<pid> is the program's process ID), but its name -can be changed with the option. This -file is human-readable, but is intended to be interpreted by the -accompanying program cg_annotate, described in the next section. - -The default .<pid> suffix -on the output file name serves two purposes. Firstly, it means you -don't have to rename old log files that you don't want to overwrite. -Secondly, and more importantly, it allows correct profiling with the - option of -programs that spawn child processes. + +Cachegrind also writes more detailed profiling data to a file. By default this +Cachegrind output file is named cachegrind.out.<pid> +(where <pid> is the program's process ID), but its +name can be changed with the option. +This file is human-readable, but is intended to be interpreted by the +accompanying program cg_annotate, described in the next section. + -The output file can be big, many megabytes for large applications -built with full debugging information. + +The default .<pid> suffix on the output +file name serves two purposes. First, it means existing Cachegrind output files +aren't immediately overwritten. Second, and more importantly, it allows correct +profiling with the option of programs +that spawn child processes. + - Running cg_annotate -Before using cg_annotate, -it is worth widening your window to be at least 120-characters -wide if possible, as the output lines can be quite long. - -To get a function-by-function summary, run: + +Before using cg_annotate, it is worth widening your window to be at least 120 +characters wide if possible, because the output lines can be quite long. + + +Then run: cg_annotate <filename> - -on a Cachegrind output file. +on a Cachegrind output file. + + - -The Output Preamble + +The Metadata Section -The first part of the output looks like this: + +The first part of the output looks like this: + - -This is a summary of the annotation options: + +It summarizes how Cachegrind and the profiled program were run. + - - I1 cache, D1 cache, LL cache: cache configuration. So - you know the configuration with which these results were - obtained. + + Invocation: the command line used to produce this output. + - Command: the command line invocation of the program - under examination. + + Command: the command line used to run the profiled program. + - Events recorded: which events were recorded. - - - - - Events shown: the events shown, which is a subset of the events - gathered. This can be adjusted with the - option. + + Events recorded: which events were recorded. By default, this is + Ir. More events will be recorded if cache + and/or branch simulation is enabled. + - Event sort order: the sort order in which functions are - shown. For example, in this case the functions are sorted - from highest Ir counts to - lowest. If two functions have identical - Ir counts, they will then be - sorted by I1mr counts, and - so on. This order can be adjusted with the - option. - - Note that this dictates the order the functions appear. - It is not the order in which the columns - appear; that is dictated by the "events shown" line (and can - be changed with the - option). + + Events shown: the events shown, which is a subset of the events gathered. + This can be adjusted with the option. + - Threshold: cg_annotate - by default omits functions that cause very low counts - to avoid drowning you in information. In this case, - cg_annotate shows summaries the functions that account for - 99% of the Ir counts; - Ir is chosen as the - threshold event since it is the primary sort event. The - threshold can be adjusted with the - - option. + + Event sort order: the sort order used for the subsequent sections. For + example, in this case those sections are sorted from highest + Ir counts to lowest. If there are multiple + events, one will be the primary sort event, and then there can be a + secondary sort event, tertiary sort event, etc., though more than one is + rarely needed. This order can be adjusted with the + option. Note that this does not specify the order in + which the columns appear. That is specified by the "events shown" line (and + can be changed with the option). + - Chosen for annotation: names of files specified - manually for annotation; in this case none. + + Threshold: cg_annotate by default omits files and functions with very low + counts to keep the output size reasonable. By default cg_annotate only + shows files and functions that account for at least 0.1% of the primary + sort event. The threshold can be adjusted with the + option. + - Auto-annotation: whether auto-annotation was requested - via the - option. In this case no. + + Annotation: whether source file annotation is enabled. Controlled with the + option. + + +If cache simulation is enabled, details of the cache parameters will be shown +above the "Invocation" line. + + -The Global and Function-level Counts + xreflabel="Global, File, and Function-level Counts"> +Global, File, and Function-level Counts -Then follows summary statistics for the whole -program: + +Next comes the summary for the whole program: + + + +The Ir column label is suffixed with +underscores to show the bounds of the columns underneath. + + + +Then comes file:function counts. Here is the first part of that section: + + + + Ir______________________ file:function + +< 3,078,746 (37.6%, 37.6%) /home/njn/grind/ws1/cachegrind/concord.c: + 1,630,232 (19.9%) get_word + 630,918 (7.7%) hash + 461,095 (5.6%) insert + 130,560 (1.6%) add_existing + 91,014 (1.1%) init_hash_table + 88,056 (1.1%) create + 46,676 (0.6%) new_word_node + +< 1,746,038 (21.3%, 58.9%) ./malloc/./malloc/malloc.c: + 1,285,938 (15.7%) _int_malloc + 458,225 (5.6%) malloc + +< 1,107,550 (13.5%, 72.4%) ./libio/./libio/getc.c:getc + +< 551,071 (6.7%, 79.1%) ./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2 + +< 521,228 (6.4%, 85.5%) ./ctype/../include/ctype.h: + 260,616 (3.2%) __ctype_tolower_loc + 260,612 (3.2%) __ctype_b_loc + +< 468,163 (5.7%, 91.2%) ???: + 468,151 (5.7%) ??? + +< 456,071 (5.6%, 96.8%) /usr/include/ctype.h:get_word + +]]> + + +Each entry covers one file, and one or more functions within that file. If +there is only one significant function within a file, as in the first entry, +the file and function are shown on the same line separate by a colon. If there +are multiple significant functions within a file, as in the third entry, each +function gets its own line. + + + +This example involves a small C program, and shows a combination of code from +the program itself (including functions like get_word and +hash in the file concord.c) as well +as code from system libraries, such as functions like +malloc and getc. + + + +Each entry is preceded with a <, which can +be useful when navigating through the output in an editor, or grepping through +results. + -These are similar to the summary provided when Cachegrind finishes running. +The first percentage in each column indicates the proportion of the total event +count is covered by this line. The second percentage, which only shows on the +first line of each entry, shows the cumulative percentage of all the entries up +to and including this one. The entries shown here account for 96.8% of the +instructions executed by the program. -Then comes function-by-function statistics: + +The name ??? is used if the file name and/or +function name could not be determined from debugging information. If +??? filenames dominate, the program probably wasn't +compiled with . If ??? function names +dominate, the program may have had symbols stripped. + + + +After that comes function:file counts. Here is the first part of that section: + - -Each function -is identified by a -file_name:function_name pair. If -a column contains only a dot it means the function never performs -that event (e.g. the third row shows that -strcmp() contains no -instructions that write to memory). The name -??? is used if the file name -and/or function name could not be determined from debugging -information. If most of the entries have the form -???:??? the program probably -wasn't compiled with . - -It is worth noting that functions will come both from -the profiled program (e.g. concord.c) -and from libraries (e.g. getc.c) + Ir______________________ function:file + +> 2,086,303 (25.5%, 25.5%) get_word: + 1,630,232 (19.9%) /home/njn/grind/ws1/cachegrind/concord.c + 456,071 (5.6%) /usr/include/ctype.h + +> 1,285,938 (15.7%, 41.1%) _int_malloc:./malloc/./malloc/malloc.c + +> 1,107,550 (13.5%, 54.7%) getc:./libio/./libio/getc.c + +> 630,918 (7.7%, 62.4%) hash:/home/njn/grind/ws1/cachegrind/concord.c + +> 551,071 (6.7%, 69.1%) __strcmp_avx2:./string/../sysdeps/x86_64/multiarch/strcmp-avx2.S + +> 480,248 (5.9%, 74.9%) malloc: + 458,225 (5.6%) ./malloc/./malloc/malloc.c + 22,023 (0.3%) ./malloc/./malloc/arena.c + +> 468,151 (5.7%, 80.7%) ???:??? + +> 461,095 (5.6%, 86.3%) insert:/home/njn/grind/ws1/cachegrind/concord.c +]]> + + +This is similar to the previous section, but is grouped by functions first and +files second. Also, the entry markers are > +instead of <. + + + +You might wonder why this section is needed, and how it differs from the +previous section. The answer is inlining. In this example there are two entries +demonstrating a function whose code is effectively spread across more than one +file: get_word and malloc. Here is an +example from profiling the Rust compiler, a much larger program that uses +inlining more: + + + 30,469,230 (1.3%, 11.1%) ::intern_ty: + 10,269,220 (0.5%) /home/njn/.cargo/registry/src/github.com-1ecc6299db9ec823/hashbrown-0.12.3/src/raw/mod.rs + 7,696,827 (0.3%) /home/njn/dev/rust0/compiler/rustc_middle/src/ty/context.rs + 3,858,099 (0.2%) /home/njn/dev/rust0/library/core/src/cell.rs +]]> + + +In this case the compiled function intern_ty includes code +from three different source files, due to inlining. These should be examined +together. Older versions of cg_annotate presented this entry as three separate +file:function entries, which would typically be intermixed with all the other +entries, making it hard to see that they are all really part of the same +function. + - -Line-by-line Counts + +Per-line Counts + + +By default, a source file is annotated if it contains at least one function +that meets the significance threshold. This can be disabled with the + option. + -By default, all source code annotation is also shown. (Filenames to be -annotated can also by specified manually as arguments to cg_annotate, but this -is rarely needed.) For example, the output from running cg_annotate -<filename> for our example produces the same output as above -followed by an annotated version of concord.c, a section -of which looks like: + +To continue the previous example, here is part of the annotation of the file +concord.c: + ;word, data->line, table); - . . . . . . . . . - 4 0 0 1 0 0 2 0 0 free(data); - 4 0 0 1 0 0 2 0 0 fclose(file_ptr); - 3 0 0 2 0 0 . . . }]]> - -(Although column widths are automatically minimised, a wide -terminal is clearly useful.) - -Each source file is clearly marked -(User-annotated source) as -having been chosen manually for annotation. If the file was -found in one of the directories specified with the -/ option, the directory -and file are both given. - -Each line is annotated with its event counts. Events not -applicable for a line are represented by a dot. This is useful -for distinguishing between an event which cannot happen, and one -which can but did not. - -Sometimes only a small section of a source file is -executed. To minimise uninteresting output, Cachegrind only shows -annotated lines and lines within a small distance of annotated -lines. Gaps are marked with the line numbers so you know which -part of a file the shown code comes from, eg: +Ir____________ + + . /* Function builds the hash table from the given file. */ + . void init_hash_table(char *file_name, Word_Node *table[]) + 8 (0.0%) { + . FILE *file_ptr; + . Word_Info *data; + 2 (0.0%) int line = 1, i; + . + . /* Structure used when reading in words and line numbers. */ + 3 (0.0%) data = (Word_Info *) create(sizeof(Word_Info)); + . + . /* Initialise entire table to NULL. */ + 2,993 (0.0%) for (i = 0; i < TABLE_SIZE; i++) + 997 (0.0%) table[i] = NULL; + . + . /* Open file, check it. */ + 4 (0.0%) file_ptr = fopen(file_name, "r"); + 2 (0.0%) if (!(file_ptr)) { + . fprintf(stderr, "Couldn't open '%s'.\n", file_name); + . exit(EXIT_FAILURE); + . } + . + . /* 'Get' the words and lines one at a time from the file, and insert them + . ** into the table one at a time. */ + 55,363 (0.7%) while ((line = get_word(data, line, file_ptr)) != EOF) + 31,632 (0.4%) insert(data->word, data->line, table); + . + 2 (0.0%) free(data); + 2 (0.0%) fclose(file_ptr); + 6 (0.0%) } +]]> + + +Each executed line is annotated with its event counts. Other lines are +annotated with a dot. This may be because they contain no executable code, or +they contain executable code but were never executed. + + + +You can easily tell if a function is inlined from this output. If it is not +inlined, it will have event counts on the lines containing the opening and +closing braces. If it is inlined, it will not have event counts on those lines. +In the example above, init_hash_table does have counts, +so you can tell it is not inlined. + + + +Note again that inlining can lead to surprising results. If a function +f is always inlined, in the file:function and +function:file sections counts will be attributed to the functions it is inlined +into, rather than itself. However, if you look at the line-by-line annotations +for f you'll see the counts that belong to +f. So it's worth looking for large counts/percentages in the +line-by-line annotations. + + + +Sometimes only a small section of a source file is executed. To minimise +uninteresting output, Cachegrind only shows annotated lines and lines within a +small distance of annotated lines. Gaps are marked with line numbers, for +example: + - -The amount of context to show around annotated lines is -controlled by the -option. - -Automatic annotation is enabled by default. -cg_annotate will automatically annotate every source file it can -find that is mentioned in the function-by-function summary. -Therefore, the files chosen for auto-annotation are affected by -the and - options. Each -source file is clearly marked (Auto-annotated -source) as being chosen automatically. Any -files that could not be found are mentioned at the end of the -output, eg: +(counts and code for line 704) +-- line 375 ---------------------------------------- +-- line 514 ---------------------------------------- +(counts and code for line 878) +]]> + + +The number of lines of context shown around annotated lines is controlled by +the option. + + + +Any significant source files that could not be found are shown like this: + - -This is quite common for library files, since libraries are -usually compiled with debugging information, but the source files -are often not present on a system. If a file is chosen for -annotation both manually and automatically, it -is marked as User-annotated -source. Use the -/ option to tell Valgrind where -to look for source files if the filenames found from the debugging -information aren't specific enough. - - Beware that auto-annotation can produce a lot of output if your program -is large. +-------------------------------------------------------------------------------- +-- Annotated source file: ./malloc/./malloc/malloc.c +-------------------------------------------------------------------------------- +Unannotated because one or more of these original files are unreadable: +- ./malloc/./malloc/malloc.c +]]> - + +This is common for library files, because libraries are usually compiled with +debugging information but the source files are rarely present on a system. + + + +Cachegrind relies heavily on accurate debug info. Sometimes compilers do not +map a particular compiled instruction to line number 0, where the 0 represents +"unknown" or "none". This is annoying but does happen in practice. cg_annotate +prints these in the following way: + + -Annotating Assembly Code Programs +1,046,746 (0.0%) +]]> -Valgrind can annotate assembly code programs too, or annotate -the assembly code generated for your C program. Sometimes this is -useful for understanding what is really happening when an -interesting line of C code is translated into multiple -instructions. + +Finally, when annotation is performed, the output ends with a summary of how +many counts were annotated and unannotated, and why. For example: + -To do this, you just need to assemble your -.s files with assembly-level debug -information. You can use compile with the to compile C/C++ -programs to assembly code, and then assemble the assembly code files with - to achieve this. You can then profile and annotate the -assembly code source files in the same way as C/C++ source files. + + Forking Programs -If your program forks, the child will inherit all the profiling data that -has been gathered for the parent. - -If the output file format string (controlled by -) does not contain , -then the outputs from the parent and child will be intermingled in a single -output file, which will almost certainly make it unreadable by -cg_annotate. + + +If your program forks, the child will inherit all the profiling data that +has been gathered for the parent. + + + +If the output file name (controlled by ) +does not contain , then the outputs from the parent and +child will be intermingled in a single output file, which will almost certainly +make it unreadable by cg_annotate. + + cg_annotate Warnings -There are a couple of situations in which -cg_annotate issues warnings. + +There are two situations in which cg_annotate prints warnings. + - If a source file is more recent than the - cachegrind.out.<pid> file. - This is because the information in - cachegrind.out.<pid> is only - recorded with line numbers, so if the line numbers change at - all in the source (e.g. lines added, deleted, swapped), any - annotations will be incorrect. + + If a source file is more recent than the Cachegrind output file. This is + because the information in the Cachegrind output file is only recorded with + line numbers, so if the line numbers change at all in the source (e.g. + lines added, deleted, swapped), any annotations will be incorrect. + - If information is recorded about line numbers past the - end of a file. This can be caused by the above problem, - i.e. shortening the source file while using an old - cachegrind.out.<pid> file. If - this happens, the figures for the bogus lines are printed - anyway (clearly marked as bogus) in case they are - important. + + If information is recorded about line numbers past the end of a file. This + can be caused by the above problem, e.g. shortening the source file while + using an old Cachegrind output file. If this happens, the figures for the + bogus lines are printed anyway (and clearly marked as bogus) in case they + are important. + + +Merging Cachegrind Output Files - -Unusual Annotation Cases + +cg_annotate can merge data from multiple Cachegrind output files in a single +run. (There is also a program called cg_merge that can merge multiple +Cachegrind output files into a single Cachegrind output file, but it is now +deprecated because cg_annotate's merging does a better job.) + -Some odd things that can occur during annotation: + +Use it as follows: + - - - If annotating at the assembler level, you might see - something like this: - - How can the third instruction be executed twice when - the others are executed only once? As it turns out, it - isn't. Here's a dump of the executable, using - objdump -d: - - - Notice the extra mov - %esi,%esi instruction. Where did this come - from? The GNU assembler inserted it to serve as the two - bytes of padding needed to align the movl - $.LnrB,%eax instruction on a four-byte - boundary, but pretended it didn't exist when adding debug - information. Thus when Valgrind reads the debug info it - thinks that the movl - $0x1,0xffffffec(%ebp) instruction covers the - address range 0x8048f2b--0x804833 by itself, and attributes - the counts for the mov - %esi,%esi to it. - - - - - - Sometimes, the same filename might be represented with - a relative name and with an absolute name in different parts - of the debug info, eg: - /home/user/proj/proj.h and - ../proj.h. In this case, if you use - auto-annotation, the file will be annotated twice with the - counts split between the two. - - - - If you compile some files with - and some without, some - events that take place in a file without debug info could be - attributed to the last line of a file with debug info - (whichever one gets placed before the non-debug-info file in - the executable). - +cg_annotate file1 file2 file3 ... +]]> - + +cg_annotate computes the sum of these files (effectively +file1 + file2 + +file3), and then produces output as usual that shows the +summed counts. + -These cases should be rare. + +The most common merging scenario is if you want to aggregate costs over +multiple runs of the same program, possibly on different inputs. + - -Merging Profiles with cg_merge + +Differencing Cachegrind output files -cg_merge is a simple program which -reads multiple profile files, as created by Cachegrind, merges them -together, and writes the results into another file in the same format. -You can then examine the merged results using -cg_annotate <filename>, as -described above. The merging functionality might be useful if you -want to aggregate costs over multiple runs of the same program, or -from a single parallel run with multiple instances of the same -program. +cg_annotate can diff data from two Cachegrind output files in a single run. +(There is also a program called cg_diff that can diff two Cachegrind output +files into a single Cachegrind output file, but it is now deprecated because +cg_annotate's differencing does a better job.) + -cg_merge is invoked as follows: +Use it as follows: +cg_annotate --diff file1 file2 +]]> -It reads and checks file1, then read -and checks file2 and merges it into -the running totals, then the same with -file3, etc. The final results are -written to outputfile, or to standard -out if no output file is specified. +cg_annotate computes the difference between these two files (effectively +file2 - file1), and then +produces output as usual that shows the count differences. Note that many of +the counts may be negative; this indicates that the counts for the relevant +file/function/line are smaller in the second version than those in the first +version. + -Costs are summed on a per-function, per-line and per-instruction -basis. Because of this, the order in which the input files does not -matter, although you should take care to only mention each file once, -since any file mentioned twice will be added in twice. +The simplest common scenario is comparing two Cachegrind output files that came +from the same program, but on different inputs. cg_annotate will do a good job +on this without assistance. + -cg_merge does not attempt to check -that the input files come from runs of the same executable. It will -happily merge together profile files from completely unrelated -programs. It does however check that the -Events: lines of all the inputs are -identical, so as to ensure that the addition of costs makes sense. -For example, it would be nonsensical for it to add a number indicating -D1 read references to a number from a different file indicating LL -write misses. +A more complex scenario is if you want to compare Cachegrind output files from +two slightly different versions of a program that you have sitting +side-by-side, running on the same input. For example, you might have +version1/prog.c and version2/prog.c. +A straight comparison of the two would not be useful. Because functions are +always paired with filenames, a function f would be listed +as version1/prog.c:f for the first version but +version2/prog.c:f for the second version. + -A number of other syntax and sanity checks are done whilst reading the -inputs. cg_merge will stop and -attempt to print a helpful error message if any of the input files -fail these checks. - - - - - -Differencing Profiles with cg_diff +In this case, use the option. Its argument is a +search-and-replace expression that will be applied to all the filenames in both +Cachegrind output files. It can be used to remove minor differences in +filenames. For example, the option + will suffice for the +above example. + -cg_diff is a simple program which -reads two profile files, as created by Cachegrind, finds the difference -between them, and writes the results into another file in the same format. -You can then examine the merged results using -cg_annotate <filename>, as -described above. This is very useful if you want to measure how a change to -a program affected its performance. +Similarly, sometimes compilers auto-generate certain functions and give them +randomized names like T.1234 where the suffixes vary from +build to build. You can use the option to +remove small differences like these; it works in the same way as +. -cg_diff is invoked as follows: +When is used to compare two different versions +of the same program, cg_annotate will not annotate any file that is different +between the two versions, because the per-line counts are not reliable in such +a case. For example, imagine if version2/prog.c is the +same as version1/prog.c except with an extra blank line at +the top of the file. Every single per-line count will have changed. In +comparison, the per-file and per-function counts have not changed, and are +still very useful for determining differences between programs. You might think +that this means every interesting file will be left unannotated, but again +inlining means that files that are identical in the two versions can have +different counts on many lines. - - -It reads and checks file1, then read -and checks file2, then computes the -difference (effectively file1 - -file2). The final results are written to -standard output. + - -Costs are summed on a per-function basis. Per-line costs are not summed, -because doing so is too difficult. For example, consider differencing two -profiles, one from a single-file program A, and one from the same program A -where a single blank line was inserted at the top of the file. Every single -per-line count has changed. In comparison, the per-function counts have not -changed. The per-function count differences are still very useful for -determining differences between programs. Note that because the result is -the difference of two profiles, many of the counts will be negative; this -indicates that the counts for the relevant function are fewer in the second -version than those in the first version. + +Cache and Branch Simulation -cg_diff does not attempt to check -that the input files come from runs of the same executable. It will -happily merge together profile files from completely unrelated -programs. It does however check that the -Events: lines of all the inputs are -identical, so as to ensure that the addition of costs makes sense. -For example, it would be nonsensical for it to add a number indicating -D1 read references to a number from a different file indicating LL -write misses. +Cachegrind can simulate how your program interacts with a machine's cache +hierarchy and/or branch predictor. + +The cache simulation models a machine with independent first-level instruction +and data caches (I1 and D1), backed by a unified second-level cache (L2). For +these machines (in the cases where Cachegrind can auto-detect the cache +configuration) Cachegrind simulates the first-level and last-level caches. +Therefore, Cachegrind always refers to the I1, D1 and LL (last-level) caches. + -A number of other syntax and sanity checks are done whilst reading the -inputs. cg_diff will stop and -attempt to print a helpful error message if any of the input files -fail these checks. +When simulating the cache, with , Cachegrind +gathers the following statistics: + + + + + + I cache reads (Ir, which equals the number + of instructions executed), I1 cache read misses + (I1mr) and LL cache instruction read + misses (ILmr). + + + + + D cache reads (Dr, which equals the number + of memory reads), D1 cache read misses + (D1mr), and LL cache data read misses + (DLmr). + + + + + D cache writes (Dw, which equals the + number of memory writes), D1 cache write misses + (D1mw), and LL cache data write misses + (DLmw). + + + -Sometimes you will want to compare Cachegrind profiles of two versions of a -program that you have sitting side-by-side. For example, you might have -version1/prog.c and -version2/prog.c, where the second is -slightly different to the first. A straight comparison of the two will not -be useful -- because functions are qualified with filenames, a function -f will be listed as -version1/prog.c:f for the first version but -version2/prog.c:f for the second -version. +Note that D1 total accesses is given by D1mr + +D1mw, and that LL total accesses is given by +ILmr + DLmr + +DLmw. + -When this happens, you can use the option. -Its argument is a Perl search-and-replace expression that will be applied -to all the filenames in both Cachegrind output files. It can be used to -remove minor differences in filenames. For example, the option - will suffice for -this case. +When simulating the branch predictor, with , +Cachegrind gathers the following statistics: + + + + + + Conditional branches executed (Bc) and + conditional branches mispredicted (Bcm). + + + + + Indirect branches executed (Bi) and + indirect branches mispredicted (Bim). + + + -Similarly, sometimes compilers auto-generate certain functions and give them -randomized names. For example, GCC sometimes auto-generates functions with -names like T.1234, and the suffixes vary from build to -build. You can use the option to remove -small differences like these; it works in the same way as -. +When cache and/or branch simulation is enabled, cg_annotate will print multiple +counts per line of output. For example: + - + 8,547 (0.1%, 99.4%) 936 (0.1%, 99.1%) 177 (0.3%, 96.7%) 59 (0.0%, 99.9%) 38 (19.4%, 66.3%) strcmp: + 8,503 (0.1%) 928 (0.1%) 175 (0.3%) 59 (0.0%) 38 (19.4%) ./string/../sysdeps/x86_64/multiarch/../multiarch/strcmp-sse2.S +]]> - + + Cachegrind Command-line Options -Cachegrind-specific options are: + +Cachegrind-specific options are: + - + - + - Specify the size, associativity and line size of the level 1 - instruction cache. + + Write the Cachegrind output file to file rather than + to the default output file, + cachegrind.out.<pid>. The + and format specifiers can be used to embed the + process ID and/or the contents of an environment variable in the name, as + is the case for the core option + . + - + - + - Specify the size, associativity and line size of the level 1 - data cache. + + Enables or disables collection of cache access and miss counts. + - + - + - Specify the size, associativity and line size of the last-level - cache. + + Enables or disables collection of branch instruction and + misprediction counts. + - + - + - Enables or disables collection of cache access and miss - counts. + + Specify the size, associativity and line size of the level 1 instruction + cache. Only useful with . + - + - + - Enables or disables collection of branch instruction and - misprediction counts. By default this is disabled as it - slows Cachegrind down by approximately 25%. Note that you - cannot specify - and - together, as that would leave Cachegrind with no - information to collect. + + Specify the size, associativity and line size of the level 1 data cache. + Only useful with . + - + - + - Write the profile data to - file rather than to the default - output file, - cachegrind.out.<pid>. The - and format specifiers - can be used to embed the process ID and/or the contents of an - environment variable in the name, as is the case for the core - option . + + Specify the size, associativity and line size of the last-level cache. + Only useful with . @@ -895,94 +945,114 @@ small differences like these; it works in the same way as - + - Specifies which events to show (and the column - order). Default is to use all present in the - cachegrind.out.<pid> file (and - use the order in the file). Useful if you want to concentrate on, for - example, I cache misses (), or data - read misses (), or LL data misses - (). Best used in conjunction with - . + Diff two Cachegrind output files. - + - Specifies the events upon which the sorting of the - function-by-function entries will be based. + + Specifies an search-and-replace expression + that is applied to all filenames. Useful when differencing, for removing + minor differences in paths between two different versions of a program + that are sitting in different directories. An suffix + makes the regex case-insensitive, and a suffix makes + it match multiple times. + - + - Sets the threshold for the function-by-function - summary. A function is shown if it accounts for more than X% - of the counts for the primary sort event. If auto-annotating, also - affects which files are annotated. - - Note: thresholds can be set for more than one of the - events by appending any events for the - option with a colon - and a number (no spaces, though). E.g. if you want to see - each function that covers more than 1% of LL read misses or 1% of LL - write misses, use this option: - + + Like , but for filenames. Useful for + removing minor differences in randomized names of auto-generated + functions generated by some compilers. + - + - When enabled, a percentage is printed next to all event counts. - This helps gauge the relative importance of each function and line. + + Specifies which events to show (and the column order). Default is to use + all present in the Cachegrind output file (and use the order in the + file). Best used in conjunction with . - + - When enabled, automatically annotates every file that - is mentioned in the function-by-function summary that can be - found. Also gives a list of those that couldn't be found. + + Specifies the events upon which the sorting of the file:function and + function:file entries will be based. + - + + + + + Sets the significance threshold for the file:function and function:files + sections. A file or function is shown if it accounts for more than X% of + the counts for the primary sort event. If annotating source files, this + also affects which files are annotated. + + + + + + + + + + + When enabled, a percentage is printed next to all event counts. This + helps gauge the relative importance of each function and line. + + + + + + + - Print N lines of context before and after each - annotated line. Avoids printing large sections of source - files that were not executed. Use a large number - (e.g. 100000) to show all source lines. + + Enables or disables source file annotation. + -