]>
Commit | Line | Data |
---|---|---|
7a515757 VM |
1 | \input texinfo @c -*-texinfo-*- |
2 | ||
3 | @c ---------------------------------------------------------------------------- | |
4 | @c This is the Texinfo source file for the GPROFNG manual. This manual | |
5 | @c includes the man pages for the various tools. | |
6 | @c | |
7 | @c Author: Ruud van der Pas | |
8 | @c ---------------------------------------------------------------------------- | |
9 | ||
10 | @c %**start of header | |
11 | ||
12 | @setfilename gprofng.info | |
13 | @settitle GNU gprofng | |
14 | ||
15 | @c -- Set the indent for the @example command to 1 space, not 5 --------------- | |
16 | @exampleindent 1 | |
17 | ||
18 | @paragraphindent 3 | |
19 | ||
20 | @c %**end of header | |
21 | ||
22 | @c -- Start a new chapter on a new, odd numbered, page ------------------------ | |
23 | @setchapternewpage odd | |
24 | ||
25 | @c -- Merge all index entries into the Concepts Index ------------------------- | |
26 | @syncodeindex fn cp | |
27 | @syncodeindex ky cp | |
28 | @syncodeindex pg cp | |
29 | @syncodeindex vr cp | |
30 | ||
31 | @c -- Macros specific to gprofng ---------------------------------------------- | |
32 | @include gp-macros.texi | |
33 | ||
34 | @c -- Get the version information --------------------------------------------- | |
35 | @include version.texi | |
36 | ||
37 | @c -- Entry for the Info dir structure ---------------------------------------- | |
38 | @ifnottex | |
39 | @dircategory Software development | |
40 | @direntry | |
41 | * gprofng: (gprofng). The next generation profiling tool for Linux | |
42 | @end direntry | |
43 | @end ifnottex | |
44 | ||
45 | @c -- Copyright stuff --------------------------------------------------------- | |
46 | @copying | |
47 | This document is the manual for @ProductName{}, last updated @value{UPDATED}. | |
48 | ||
fd67aa11 | 49 | Copyright @copyright{} 2022-2024 Free Software Foundation, Inc. |
7a515757 VM |
50 | |
51 | @c -- @quotation | |
52 | Permission is granted to copy, distribute and/or modify this document | |
53 | under the terms of the GNU Free Documentation License, | |
54 | Version 1.3 or any later version published by the Free Software | |
55 | Foundation; with no Invariant Sections, with no Front-Cover texts, | |
56 | and with no Back-Cover Texts. A copy of the license is included in the | |
57 | section entitled ``GNU Free Documentation License.'' | |
58 | ||
59 | @c -- @end quotation | |
60 | @end copying | |
61 | ||
62 | @finalout | |
63 | @smallbook | |
64 | ||
65 | @c -- Define the title page --------------------------------------------------- | |
66 | @titlepage | |
67 | @title GNU gprofng | |
68 | @subtitle The next generation profiling tool for Linux | |
69 | @subtitle version @value{VERSION} (last updated @value{UPDATED}) | |
70 | @author Ruud van der Pas | |
71 | @page | |
72 | @vskip 0pt plus 1filll | |
73 | @insertcopying | |
74 | @end titlepage | |
75 | ||
76 | @c -- Generate the Table of Contents ------------------------------------------ | |
77 | @contents | |
78 | ||
79 | @c -- The Top node ------------------------------------------------------------ | |
80 | @c Should contain a short summary, copying permissions and a master menu. | |
81 | @c ---------------------------------------------------------------------------- | |
82 | @ifnottex | |
83 | @node Top | |
84 | @top GNU Gprofng | |
85 | ||
86 | @insertcopying | |
87 | @end ifnottex | |
88 | ||
89 | @ifinfo | |
90 | @c -- The menu entries -------------------------------------------------------- | |
91 | ||
92 | @c * Display Source Code:: Display the source code and disassembly. | |
93 | @c * Archive Experiment Data:: Archive an experiment. | |
94 | ||
95 | @menu | |
46c56757 VM |
96 | * Introduction:: About this manual. |
97 | * Overview:: A brief overview of @ProductName{}. | |
98 | * A Mini Tutorial:: A short tutorial covering the key features. | |
99 | * The gprofng Tools:: An overview of the tools supported. | |
100 | * Performance Data Collection:: Record the performance information. | |
101 | * View the Performance Information:: Different ways to view the data. | |
102 | * Terminology:: Concepts and terminology explained. | |
103 | * Other Document Formats:: Create this document in other formats. | |
104 | * The gprofng Man Pages:: The gprofng man pages. | |
105 | * Index:: The index. | |
7a515757 VM |
106 | |
107 | @detailmenu | |
108 | ||
109 | --- The Detailed Node Listing --- | |
110 | ||
111 | Introduction | |
112 | ||
113 | Overview | |
114 | ||
115 | * Main Features:: A high level overview. | |
116 | * Sampling versus Tracing:: The pros and cons of sampling versus tracing. | |
117 | * Steps Needed to Create a Profile:: How to create a profile. | |
118 | ||
119 | A Mini Tutorial | |
120 | ||
121 | * Getting Started:: The basics of profiling with @ProductName(). | |
122 | * Support for Multithreading:: Commands specific to multithreaded applications. | |
123 | * View Multiple Experiments:: Analyze multiple experiments simultaneously. | |
124 | * Profile Hardware Event Counters:: How to use hardware event counters. | |
125 | * Java Profiling:: How to profile a Java application. | |
126 | ||
127 | The gprofng Tools | |
128 | ||
129 | * Tools Overview:: A brief description of the tools. | |
130 | * The gprofng.rc file with default settings:: Customize the settings. | |
131 | * Filters:: Filters. | |
132 | * Supported Environment Variables:: The supported environment variables. | |
133 | ||
46c56757 VM |
134 | Performance Data Collection |
135 | ||
136 | * The @command{gprofng collect app} Tool:: Collect application performance data. | |
137 | ||
138 | View the Performance Information | |
139 | ||
140 | * The @command{gprofng display text} Tool:: View the performance data in plain text. | |
141 | ||
7a515757 VM |
142 | Terminology |
143 | ||
144 | * The Program Counter:: What is a Program Counter? | |
145 | * Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. | |
146 | * Metric Definitions:: Definitions associated with metrics. | |
147 | * The Viewmode:: Select the way call stacks are presented. | |
148 | * The Selection List:: How to define a selection. | |
149 | * Load Objects and Functions:: The components in an application. | |
46c56757 | 150 | * The Concept of a CPU in gprofng:: The definition of a CPU. |
7a515757 VM |
151 | * Hardware Event Counters Explained:: What are event counters? |
152 | * apath:: Our generic definition of a path. | |
153 | ||
154 | The gprofng Man Pages | |
155 | ||
46c56757 VM |
156 | * Man page for @command{gprofng collect app}:: The man page for gprofng collect app. |
157 | * Man page for @command{gprofng display text}:: The man page for gprofng display text. | |
158 | * Man page for @command{gprofng display html}:: The man page for gprofng display html. | |
159 | * Man page for @command{gprofng display src}:: The man page for gprofng display src. | |
160 | * Man page for @command{gprofng archive}:: The man page for gprofng archive. | |
7a515757 VM |
161 | |
162 | @c -- Index | |
163 | ||
164 | @end detailmenu | |
165 | @end menu | |
166 | @end ifinfo | |
167 | ||
168 | @c -- A new node -------------------------------------------------------------- | |
169 | @node Introduction | |
170 | @chapter Introduction | |
171 | @c ---------------------------------------------------------------------------- | |
172 | The @ProductName{} tool is the next generation profiler for Linux. It consists | |
173 | of various commands to generate and display profile information. | |
174 | ||
175 | This manual starts with a tutorial how to create and interpret a profile. This | |
176 | part is highly practical and has the goal to get users up to speed as quickly | |
177 | as possible. As soon as possible, we would like to show you how to get your | |
178 | first profile on your screen. | |
179 | ||
180 | This is followed by more examples, covering many of the features. At the | |
181 | end of this tutorial, you should feel confident enough to tackle the more | |
182 | complex tasks. | |
183 | ||
184 | In a future update a more formal reference manual will be included as well. | |
185 | Since even in this tutorial we use certain terminology, we have included a | |
186 | chapter with descriptions at the end. In case you encounter unfamiliar | |
187 | wordings or terminology, please check this chapter. | |
188 | ||
189 | One word of caution. In several cases we had to somewhat tweak the screen | |
190 | output in order to make it fit. This is why the output may look somewhat | |
191 | different when you try things yourself. | |
192 | ||
193 | For now, we wish you a smooth profiling experience with @ProductName{} and | |
194 | good luck tackling performance bottlenecks. | |
195 | ||
196 | @c -- A new node -------------------------------------------------------------- | |
197 | @c cccccc @node A Brief Overview of @ProductName{} | |
198 | @node Overview | |
199 | @chapter A Brief Overview of @ProductName{} | |
200 | @c ---------------------------------------------------------------------------- | |
201 | ||
202 | @menu | |
203 | * Main Features:: A high level overview. | |
204 | * Sampling versus Tracing:: The pros and cons of sampling versus tracing. | |
205 | * Steps Needed to Create a Profile:: How to create a profile. | |
206 | @end menu | |
207 | ||
208 | Before we cover this tool in quite some detail, we start with a brief overview | |
209 | of what it is, and the main features. Since we know that many of you would | |
210 | like to get started rightaway, already in this first chapter we explain the | |
211 | basics of profiling with @ToolName{}. | |
212 | ||
213 | @c ---------------------------------------------------------------------------- | |
214 | @c TBD Review this text. Probably be more specific on the gcc releases and | |
215 | @c processor specifics. | |
216 | @c ---------------------------------------------------------------------------- | |
217 | ||
218 | @c -- A new node -------------------------------------------------------------- | |
219 | @node Main Features | |
220 | @section Main Features | |
221 | @c ---------------------------------------------------------------------------- | |
222 | ||
223 | @noindent | |
224 | These are the main features of the @ProductName{} tool: | |
225 | ||
226 | @itemize @bullet | |
227 | ||
228 | @item | |
229 | Profiling is supported for an application written in C, C++, Java, or Scala. | |
230 | ||
231 | @c TBD Java: up to 1.8 full support, support other than for modules | |
232 | ||
233 | @item | |
234 | Shared libraries are supported. The information is presented at the instruction | |
235 | level. | |
236 | ||
237 | @item | |
238 | The following multithreading programming models are supported: Pthreads, | |
239 | OpenMP, and Java threads. | |
240 | ||
241 | @item | |
242 | This tool works with unmodified production level executables. There is no need to | |
243 | recompile the code, but if the @samp{-g} option has been used when building | |
244 | the application, source line level information is available. | |
245 | ||
246 | @item | |
247 | The focus is on support for code generated with the @command{gcc} compiler, but | |
248 | there is some limited support for the @command{icc} compiler as well. Future | |
249 | improvements and enhancements will focus on @command{gcc} though. | |
250 | ||
251 | @item | |
252 | Processors from Intel, AMD, and Arm are supported, but the level of support | |
253 | depends on the architectural details. In particular, hardware event counters | |
254 | may not be supported. If this is the case, all views not related to these | |
255 | counters still ought to work though. | |
256 | ||
257 | @item | |
258 | Several views into the data are supported. For example, a function overview | |
259 | where the time is spent, but also a source line, disassembly, call tree and | |
260 | a caller-callees overview are available. | |
261 | ||
262 | @item | |
263 | Through filters, the user can zoom in on an area of interest. | |
264 | ||
265 | @item | |
266 | Two or more profiles can be aggregated, or used in a comparison. This comparison | |
267 | can be obtained at the function, source line, and disassembly level. | |
268 | ||
269 | @item | |
270 | Through a simple scripting language, and customization of the metrics shown, | |
271 | the generation and creation of a profile can be fully automated and provide | |
272 | tailored output. | |
273 | ||
274 | @end itemize | |
275 | ||
276 | @c -- A new node -------------------------------------------------------------- | |
277 | @node Sampling versus Tracing | |
278 | @section Sampling versus Tracing | |
279 | @c ---------------------------------------------------------------------------- | |
280 | ||
281 | A key difference with some other profiling tools is that the main data | |
282 | collection command @CollectApp{} mostly uses | |
283 | @cindex Program Counter sampling | |
284 | @cindex PC sampling | |
285 | Program Counter (PC) sampling | |
286 | under the hood. | |
287 | ||
288 | With @emph{sampling}, the executable is interrupted at regular intervals. Each | |
289 | time it is halted, key information is gathered and stored. This includes the | |
290 | Program Counter that keeps track of where the execution is. Hence the name. | |
291 | ||
292 | Together with operational data, this information is stored in the experiment | |
293 | directory and can be viewed in the second phase. | |
294 | ||
295 | For example, the PC information is used to derive where the program was when | |
296 | it was halted. Since the sampling interval is known, it is relatively easy to | |
297 | derive how much time was spent in the various parts of the program. | |
298 | ||
299 | The opposite technique is generally referred to as @emph{tracing}. With | |
300 | tracing, the target is instrumented with specific calls that collect the | |
301 | requested information. | |
302 | ||
303 | These are some of the pros and cons of PC sampling verus tracing: | |
304 | ||
305 | @itemize | |
306 | ||
307 | @item | |
308 | Since there is no need to recompile, existing executables can be used | |
309 | and the profile measures the behaviour of exactly the same executable that is | |
310 | used in production runs. | |
311 | ||
312 | With sampling, one inherently profiles a different executable, because | |
313 | the calls to the instrumentation library may affect the compiler optimizations | |
314 | and run time behaviour. | |
315 | ||
316 | @item | |
317 | With sampling, there are very few restrictions on what can be profiled and even without | |
318 | access to the source code, a basic profile can be made. | |
319 | ||
320 | @item | |
321 | A downside of sampling is that, depending on the sampling frequency, small | |
322 | functions may be missed or not captured accurately. Although this is rare, | |
323 | this may happen and is the reason why the user has control over the sampling rate. | |
324 | ||
325 | @item | |
326 | While tracing produces precise information, sampling is statistical in nature. | |
327 | As a result, small variations may occur across seemingly identical runs. We | |
328 | have not observed more than a few percent deviation though. Especially if | |
329 | the target job executed for a sufficiently long time. | |
330 | ||
331 | @item | |
332 | With sampling, it is not possible to get an accurate count how often | |
333 | functions are called. | |
334 | ||
335 | @end itemize | |
336 | ||
337 | @c -- A new node -------------------------------------------------------------- | |
338 | @node Steps Needed to Create a Profile | |
339 | @section Steps Needed to Create a Profile | |
340 | @c ---------------------------------------------------------------------------- | |
341 | ||
342 | Creating a profile takes two steps. First the profile data needs to be | |
343 | generated. This is followed by a viewing step to create a report from the | |
344 | information that has been gathered. | |
345 | ||
346 | Every @ProductName{} command starts with @ToolName{}, the name of the driver. | |
347 | This is followed by a keyword to define the high level functionality. Depending | |
348 | on this keyword, a third qualifier may be needed to further narrow down the request. | |
349 | This combination is then followed by options that are specific to the functionality | |
350 | desired. | |
351 | ||
352 | The command to gather, or ``collect'', the performance data is called | |
353 | @CollectApp{}. Aside from numerous options, this command takes the name | |
354 | of the target executable as an input parameter. | |
355 | ||
356 | Upon completion of the run, the performance data can be | |
357 | found in the newly created | |
358 | @cindex Experiment directory | |
359 | experiment directory. | |
360 | ||
361 | Unless explicitly specified otherwise, a default | |
362 | name for this directory is chosen. The name is @file{test.<n>.er} where | |
363 | @var{<n>} is the first integer number not in use yet for such a name. | |
364 | ||
365 | For example, the first time @CollectApp{} is invoked, an experiment | |
366 | directory with the name @file{test.1.er} is created. | |
367 | Upon a subsequent invocation of @CollectApp{} in the same directory, | |
368 | an experiment directory with the name @file{test.2.er} will be created, | |
369 | and so forth. | |
370 | ||
371 | Note that @CollectApp{} supports an option to explicitly name the experiment | |
372 | directory. | |
373 | Aside from the restriction that the name of this directory has to end | |
374 | with @samp{.er}, any valid directory name can be used for this. | |
375 | ||
376 | Now that we have the performance data, the next step is to display it. | |
377 | ||
378 | @IndexSubentry{@code{gprofng}, @code{display text}} | |
379 | The most commonly used command to view the performance information is | |
380 | @DisplayText{}. This is a very extensive and customizable tool that | |
381 | produces the information in ASCII format. | |
382 | ||
383 | @IndexSubentry{@code{gprofng}, @code{display html}} | |
384 | Another option is to use @DisplayHTML{}. This tool generates a directory with | |
385 | files in html format. These can be viewed in a browser, allowing for easy | |
386 | navigation through the profile data. | |
387 | ||
388 | @c -- A new node -------------------------------------------------------------- | |
389 | @node A Mini Tutorial | |
390 | @chapter A Mini Tutorial | |
391 | @c ---------------------------------------------------------------------------- | |
392 | ||
393 | In this chapter we present and discuss the main functionality of @ToolName{}. | |
394 | This will be a practical approach, using an example code to generate profile | |
395 | data and show how to get various performance reports. | |
396 | ||
397 | @menu | |
398 | * Getting Started:: The basics of profiling with @ProductName(). | |
399 | * Support for Multithreading:: Commands specific to multithreaded applications. | |
400 | * View Multiple Experiments:: Analyze multiple experiments simultaneously. | |
401 | * Profile Hardware Event Counters:: How to use hardware event counters. | |
402 | * Java Profiling:: How to profile a Java application. | |
403 | @end menu | |
404 | ||
405 | @c -- A new node -------------------------------------------------------------- | |
406 | @node Getting Started | |
407 | @section Getting Started | |
408 | @c ---------------------------------------------------------------------------- | |
409 | ||
410 | The information presented here provides a good and common basis for many | |
411 | profiling tasks, but there are more features that you may want to leverage. | |
412 | ||
413 | These are covered in subsequent sections in this chapter. | |
414 | ||
415 | @menu | |
416 | * The Example Program:: A description of the example program used. | |
417 | * A First Profile:: How to get the first profile. | |
418 | * The Source Code View:: Display the metrics in the source code. | |
419 | * The Disassembly View:: Display the metrics at the instruction level. | |
420 | * Display and Define the Metrics:: An example how to customize the metrics. | |
421 | * Customization of the Output:: An example how to customize the output. | |
422 | * Name the Experiment Directory:: Change the name of the experiment directory. | |
423 | * Control the Number of Lines in the Output:: Change the number of lines in the tables. | |
424 | * Sorting the Performance Data:: How to set the metric to sort by. | |
425 | * Scripting:: Use a script to execute the commands. | |
426 | * A More Elaborate Example:: An example of customization. | |
427 | * The Call Tree:: Display the dynamic call tree. | |
428 | * More Information on the Experiment:: How to get additional statistics. | |
429 | * Control the Sampling Frequency:: How to control the sampling granularity. | |
430 | * Information on Load Objects:: How to get more information on load objects. | |
431 | @end menu | |
432 | ||
433 | @c -- A new node -------------------------------------------------------------- | |
434 | @node The Example Program | |
435 | @subsection The Example Program | |
436 | @c ---------------------------------------------------------------------------- | |
437 | ||
438 | Throughout this guide we use the same example C code that implements the | |
439 | multiplication of a vector of length @math{n} by an @math{m} by @math{n} | |
440 | matrix. The result is stored in a vector of length @math{m}. | |
441 | @cindex Pthreads | |
442 | @cindex Posix Threads | |
443 | The algorithm has been parallelized using Posix Threads, or Pthreads for short. | |
444 | ||
445 | The code was built using the @code{gcc} compiler and the name of the executable | |
446 | is | |
447 | @cindex mxv-pthreads | |
448 | @command{mxv-pthreads}. | |
449 | ||
450 | The matrix sizes can be set through the @code{-m} and @code{-n} options. The | |
451 | number of threads is set with the @code{-t} option. These are additional threads | |
452 | that are used in the multiplication. To increase the duration of the run, the | |
453 | computations are executed repeatedly. | |
454 | ||
455 | This is an example that multiplies a @math{8000} by @math{4000} matrix with | |
456 | a vector of length @math{4000}. Although this is a multithreaded application, | |
457 | initially we will be using @math{1} threads. Later on we will show examples | |
458 | using multiple threads. | |
459 | ||
460 | @smallexample | |
461 | @verbatim | |
462 | $ ./mxv-pthreads -m 8000 -n 4000 -t 1 | |
463 | mxv: error check passed - rows = 8000 columns = 4000 threads = 1 | |
464 | $ | |
465 | @end verbatim | |
466 | @end smallexample | |
467 | ||
468 | The program performs an internal check to verify that the computed results | |
469 | are correct. The result of this check is printed, as well as the matrix | |
470 | sizes and the number of threads used. | |
471 | ||
472 | @c -- A new node -------------------------------------------------------------- | |
473 | @node A First Profile | |
474 | @subsection A First Profile | |
475 | @c ---------------------------------------------------------------------------- | |
476 | ||
477 | The first step is to collect the performance data. It is important to remember | |
478 | that much more information is gathered than may be shown by default. Often a | |
479 | single data collection run is sufficient to get a lot of insight. | |
480 | ||
481 | The @CollectApp{} command is used for the data collection. Nothing needs to be | |
482 | changed in the way the application is executed. The only difference is that it | |
483 | is now run under control of the tool, as shown below: | |
484 | ||
485 | @cartouche | |
486 | @smallexample | |
487 | $ gprofng collect app ./mxv-pthreads -m 8000 -n 4000 -t 1 | |
488 | @end smallexample | |
489 | @end cartouche | |
490 | ||
491 | @noindent | |
492 | This produces the following output: | |
493 | ||
494 | @smallexample | |
495 | @verbatim | |
496 | Creating experiment directory test.1.er (Process ID: 2749878) ... | |
497 | mxv: error check passed - rows = 8000 columns = 4000 threads = 1 | |
498 | @end verbatim | |
499 | @end smallexample | |
500 | ||
501 | We see a message that an experiment directory with the name @file{test.1.er} | |
502 | has been created. The process id is also echoed. The application completes | |
503 | as usual and we have our first experiment directory that can be analyzed. | |
504 | ||
505 | The tool we use for this is called @DisplayText{}. It takes the name of | |
506 | the experiment directory as an argument. | |
507 | ||
508 | @cindex Interpreter mode | |
509 | If invoked this way, the tool starts in the interactive @emph{interpreter} mode. | |
510 | While in this environment, commands can be given and the tool responds. This is | |
511 | illustrated below: | |
512 | ||
513 | @smallexample | |
514 | @verbatim | |
515 | $ gprofng display text test.1.er | |
516 | Warning: History and command editing is not supported on this system. | |
517 | (gp-display-text) quit | |
518 | $ | |
519 | @end verbatim | |
520 | @end smallexample | |
521 | ||
522 | @cindex Command line mode | |
523 | While useful in certain cases, we prefer to use this tool in command line mode | |
524 | by specifying the commands to be issued when invoking the tool. The way to do | |
525 | this is to prepend the command(s) with a hyphen (@samp{-}) if used on the | |
526 | command line. | |
527 | ||
528 | Since this makes the command appear to be an option, they are also sometimes | |
529 | referred to as such, but technically they are commands. This is the | |
530 | terminology we will use in this user guide, but for convenience the commands | |
531 | are also listed as options in the index. | |
532 | ||
533 | For example, | |
534 | @IndexSubentry{Options, @code{-functions}} | |
535 | @IndexSubentry{Commands, @code{functions}} | |
536 | below we use the @command{functions} command to request a list of the functions | |
537 | that have been executed, plus their respective CPU times: | |
538 | ||
539 | @cartouche | |
540 | @smallexample | |
541 | $ gprofng display text -functions test.1.er | |
542 | @end smallexample | |
543 | @end cartouche | |
544 | ||
545 | @smallexample | |
546 | @verbatim | |
547 | $ gprofng display text -functions test.1.er | |
548 | ||
549 | Functions sorted by metric: Exclusive Total CPU Time | |
550 | ||
551 | Excl. Total Incl. Total Name | |
552 | CPU CPU | |
553 | sec. % sec. % | |
554 | 9.367 100.00 9.367 100.00 <Total> | |
555 | 8.926 95.30 8.926 95.30 mxv_core | |
556 | 0.210 2.24 0.420 4.49 init_data | |
557 | 0.080 0.85 0.210 2.24 drand48 | |
558 | 0.070 0.75 0.130 1.39 erand48_r | |
559 | 0.060 0.64 0.060 0.64 __drand48_iterate | |
560 | 0.010 0.11 0.020 0.21 _int_malloc | |
561 | 0.010 0.11 0.010 0.11 sysmalloc | |
562 | 0. 0. 8.926 95.30 <static>@0x47960 (<libgp-collector.so>) | |
563 | 0. 0. 0.440 4.70 __libc_start_main | |
564 | 0. 0. 0.020 0.21 allocate_data | |
565 | 0. 0. 8.926 95.30 driver_mxv | |
566 | 0. 0. 0.440 4.70 main | |
567 | 0. 0. 0.020 0.21 malloc | |
568 | 0. 0. 8.926 95.30 start_thread | |
569 | @end verbatim | |
570 | @end smallexample | |
571 | ||
572 | As easy and simple as these steps are, we do have a first profile of our program! | |
573 | ||
574 | There are five columns. The first four contain the | |
575 | @cindex Total CPU time | |
576 | ''Total CPU Time'', which | |
577 | is the sum of the user and system time. @xref{Inclusive and Exclusive Metrics} | |
578 | for an explanation of ``exclusive'' and ``inclusive'' times. | |
579 | ||
580 | The first line echoes the metric that is used to sort the output. By default, | |
581 | this is the exclusive CPU time, but through the @command{sort} command, the sort | |
582 | metric can be changed by the user. | |
583 | ||
584 | Next, there are four columns with the exclusive and inclusive CPU times and the | |
585 | respective percentages. This is followed by the name of the function. | |
586 | ||
587 | @IndexSubentry{Miscellaneous, @code{<Total>}} | |
588 | The function with the name @code{<Total>} is not a user function. It is a | |
589 | pseudo function introduced by @ToolName{}. It is used to display the | |
590 | accumulated measured metric values. In this example, we see that the total | |
591 | CPU time of this job was 9.367 seconds and it is scaled to 100%. All | |
592 | other percentages in the same column are relative to this number. | |
593 | ||
594 | @c -- If the metric is derived, for example the @code{IPC}, the value shown under | |
595 | @c -- @code{<Total>} is based upon the total values of the that are metrics used to | |
596 | @c -- compute the derived metric. | |
597 | @c -- @IndexSubentry{Hardware event counters, IPC} | |
598 | ||
599 | With 8.926 seconds, function @code{mxv_core} takes 95.30% of the | |
600 | total time and is by far the most time consuming function. | |
601 | The exclusive and inclusive metrics are identical, which means that is a | |
602 | leaf function not calling any other functions. | |
603 | ||
604 | The next function in the list is @code{init_data}. Although with 4.49%, | |
605 | the CPU time spent in this part is modest, this is an interesting entry because | |
606 | the inclusive CPU time of 0.420 seconds is twice the exclusive CPU time | |
607 | of 0.210 seconds. Clearly this function is calling another function, | |
608 | or even more than one function and collectively this takes 0.210 seconds. | |
609 | Below we show the call tree feature that provides more details on the call | |
610 | structure of the application. | |
611 | ||
612 | The function @code{<static>@@0x47960 (<libgp-collector.so>)} does odd and | |
613 | certainly not familiar. It is one of the internal functions used by | |
614 | @CollectApp{} and can be ignored. Also, while the inclusive time is high, | |
615 | the exclusive time is zero. This means it doesn't contribute to the | |
616 | performance. | |
617 | ||
618 | The question is how we know where this function originates from? There are | |
619 | several commands to dig deeper an get more details on a function. | |
620 | @xref{Information on Load Objects}. | |
621 | ||
622 | @c -- A new node -------------------------------------------------------------- | |
623 | @node The Source Code View | |
624 | @subsection The Source Code View | |
625 | @c ---------------------------------------------------------------------------- | |
626 | ||
627 | In general, the tuning efforts are best focused on the most time consuming | |
628 | part(s) of an application. In this case that is easy, since over 95% of | |
629 | the total CPU time is spent in function @code{mxv_core}. | |
630 | It is now time to dig deeper and look | |
631 | @cindex Source level metrics | |
632 | at the metrics distribution at the source code level. Since we measured | |
633 | CPU times, these are the metrics shown. | |
634 | ||
635 | @IndexSubentry{Options, @code{-source}} | |
636 | @IndexSubentry{Commands, @code{source}} | |
637 | The @code{source} command is used to accomplish this. It takes the name of the | |
638 | function, not the source filename, as an argument. This is demonstrated | |
639 | below, where the @DisplayText{} command is used to show the annotated | |
640 | source listing of function @code{mxv_core}. | |
641 | ||
642 | Be aware that when using the @command{gcc} compiler, the source code has to | |
643 | be compiled with the @code{-g} option in order for the source code feature | |
644 | to work. Otherwise the location(s) can not be determined. For other compilers | |
645 | we recommend to check the documentation for such an option. | |
646 | ||
647 | Below the command to display the source code of a function is shown. Since at | |
648 | this point we are primarily interested in the timings only, we use the | |
649 | @code{metrics} command to request the exclusive and inclusive total CPU | |
650 | timings only. @xref{Display and Define the Metrics} for more information | |
651 | how to define the metrics to be displayed. | |
652 | ||
653 | @cartouche | |
654 | @smallexample | |
655 | $ gprofng display text -metrics ei.totalcpu -source mxv_core test.1.er | |
656 | @end smallexample | |
657 | @end cartouche | |
658 | ||
659 | The output is shown below. It has been somewhat modified to fit the formatting | |
660 | constraints and reduce the number of lines. | |
661 | ||
662 | @smallexample | |
663 | @verbatim | |
664 | Current metrics: e.totalcpu:i.totalcpu:name | |
665 | Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) | |
666 | Source file: <apath>/mxv.c | |
667 | Object file: mxv-pthreads (found as test.1.er/archives/...) | |
668 | Load Object: mxv-pthreads (found as test.1.er/archives/...) | |
669 | ||
670 | Excl. Incl. | |
671 | Total Total | |
672 | CPU sec. CPU sec. | |
673 | ||
674 | <lines deleted> | |
675 | <Function: mxv_core> | |
676 | 43. void __attribute__ ((noinline)) | |
677 | mxv_core (int64_t row_index_start, | |
678 | 44. int64_t row_index_end, | |
679 | 45. int64_t m, | |
680 | 46. int64_t n, | |
681 | 47. double **restrict A, | |
682 | 48. double *restrict b, | |
683 | 49. double *restrict c) | |
684 | 50. { | |
685 | 0. 0. 50. { | |
686 | 0. 0. 51. for (int64_t i=row_index_start; | |
687 | i<=row_index_end; i++) | |
688 | 52. { | |
689 | 0. 0. 53. double row_sum = 0.0; | |
690 | ## 4.613 4.613 54. for (int64_t j=0; j<n; j++) | |
691 | ## 4.313 4.313 55. row_sum += A[i][j] * b[j]; | |
692 | 0. 0. 56. c[i] = row_sum; | |
693 | 57. } | |
694 | 0. 0. 58. } | |
695 | @end verbatim | |
696 | @end smallexample | |
697 | ||
698 | The first line echoes the metrics that have been selected. The second line | |
699 | is not very meaningful when looking at the source code listing, but it shows | |
700 | the metric that is used to sort the data. | |
701 | ||
702 | The next three lines provide information on the location of the source file, | |
703 | the object file and the load object (@xref{Load Objects and Functions}). | |
704 | ||
705 | Function @code{mxv_core} is part of a source file that has other functions | |
706 | as well. These functions will be shown with the values for the metrics, but | |
707 | for lay-out purposes they have been removed in the output shown above. | |
708 | ||
709 | The header is followed by the annotated source code listing. The selected | |
710 | metrics are shown first, followed by a source line number, and the source code. | |
711 | @IndexSubentry{Miscellaneous ,@code{##}} | |
712 | The most time consuming line(s) are marked with the @code{##} symbol. In | |
713 | this way they are easier to identify and find with a search. | |
714 | ||
715 | What we see is that all of the time is spent in lines 54-55. | |
716 | ||
717 | @IndexSubentry{Options, @code{-lines}} | |
718 | @IndexSubentry{Commands, @code{lines}} | |
719 | A related command sometimes comes handy as well. It is called @code{lines} | |
720 | and displays a list of the source lines and their metrics, ordered according | |
721 | to the current sort metric (@xref{Sorting the Performance Data}). | |
722 | ||
723 | Below the command and the output. For lay-out reasons, only the top 10 is | |
724 | shown here and the last part of the text on some lines has been replaced | |
725 | by dots. The full text is @samp{instructions without line numbers} and | |
726 | means that the line number information for that function was not found. | |
727 | ||
728 | @cartouche | |
729 | @smallexample | |
730 | $ gprofng display text -lines test.1.er | |
731 | @end smallexample | |
732 | @end cartouche | |
733 | ||
734 | @smallexample | |
735 | @verbatim | |
736 | Lines sorted by metric: Exclusive Total CPU Time | |
737 | ||
738 | Excl. Total Incl. Total Name | |
739 | CPU CPU | |
740 | sec. % sec. % | |
741 | 9.367 100.00 9.367 100.00 <Total> | |
742 | 4.613 49.25 4.613 49.25 mxv_core, line 54 in "mxv.c" | |
743 | 4.313 46.05 4.313 46.05 mxv_core, line 55 in "mxv.c" | |
744 | 0.160 1.71 0.370 3.95 init_data, line 118 in "manage_data.c" | |
745 | 0.080 0.85 0.210 2.24 <Function: drand48, instructions ...> | |
746 | 0.070 0.75 0.130 1.39 <Function: erand48_r, instructions ...> | |
747 | 0.060 0.64 0.060 0.64 <Function: __drand48_iterate, ...> | |
748 | 0.040 0.43 0.040 0.43 init_data, line 124 in "manage_data.c" | |
749 | 0.010 0.11 0.020 0.21 <Function: _int_malloc, instructions ...> | |
750 | 0.010 0.11 0.010 0.11 <Function: sysmalloc, instructions ...> | |
751 | @end verbatim | |
752 | @end smallexample | |
753 | ||
754 | What this overview immediately highlights is that the third most time consuming | |
755 | source line takes 0.370 seconds only. This means that the inclusive time is | |
756 | only 3.95% and clearly this branch of the code hardly impacts the performance. | |
757 | ||
758 | @c -- A new node -------------------------------------------------------------- | |
759 | @node The Disassembly View | |
760 | @subsection The Disassembly View | |
761 | @c ---------------------------------------------------------------------------- | |
762 | ||
763 | The source view is very useful to obtain more insight where the time is spent, | |
764 | but sometimes this is not sufficient. The disassembly view provides more | |
765 | details since it shows the metrics at the instruction level. | |
766 | ||
767 | This view is displayed with the | |
768 | @IndexSubentry{Options, @code{-disasm}} | |
769 | @IndexSubentry{Commands, @code{disasm}} | |
770 | @command{disasm} | |
771 | command and as with the source view, it displays an annotated listing. In this | |
772 | @cindex Instruction level metrics | |
773 | case it shows the instructions with the metrics, interleaved with the | |
774 | source lines. The | |
775 | instructions have a reference in square brackets (@code{[} and @code{]}) | |
776 | to the source line they correspond to. | |
777 | ||
778 | @noindent | |
779 | We again focus on the tmings only and set the metrics accordingly: | |
780 | ||
781 | @cartouche | |
782 | @smallexample | |
783 | $ gprofng display text -metrics ei.totalcpu -disasm mxv_core test.1.er | |
784 | @end smallexample | |
785 | @end cartouche | |
786 | ||
787 | @smallexample | |
788 | @verbatim | |
789 | Current metrics: e.totalcpu:i.totalcpu:name | |
790 | Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) | |
791 | Source file: <apath>/src/mxv.c | |
792 | Object file: mxv-pthreads (found as test.1.er/archives/...) | |
793 | Load Object: mxv-pthreads (found as test.1.er/archives/...) | |
794 | ||
795 | Excl. Incl. | |
796 | Total Total | |
797 | CPU sec. CPU sec. | |
798 | ||
799 | <lines deleted> | |
800 | 43. void __attribute__ ((noinline)) | |
801 | mxv_core (int64_t row_index_start, | |
802 | 44. int64_t row_index_end, | |
803 | 45. int64_t m, | |
804 | 46. int64_t n, | |
805 | 47. double **restrict A, | |
806 | 48. double *restrict b, | |
807 | 49. double *restrict c) | |
808 | 50. { | |
809 | <Function: mxv_core> | |
810 | 0. 0. [50] 401d56: mov 0x8(%rsp),%r10 | |
811 | 51. for (int64_t i=row_index_start; | |
812 | i<=row_index_end; i++) | |
813 | 0. 0. [51] 401d5b: cmp %rsi,%rdi | |
814 | 0. 0. [51] 401d5e: jg 0x47 | |
815 | 0. 0. [51] 401d60: add $0x1,%rsi | |
816 | 0. 0. [51] 401d64: jmp 0x36 | |
817 | 52. { | |
818 | 53. double row_sum = 0.0; | |
819 | 54. for (int64_t j=0; j<n; j++) | |
820 | 55 row_sum += A[i][j] * b[j]; | |
821 | 0. 0. [55] 401d66: mov (%r8,%rdi,8),%rdx | |
822 | 0. 0. [54] 401d6a: mov $0x0,%eax | |
823 | 0. 0. [53] 401d6f: pxor %xmm1,%xmm1 | |
824 | 0.110 0.110 [55] 401d73: movsd (%rdx,%rax,8),%xmm0 | |
825 | 1.921 1.921 [55] 401d78: mulsd (%r9,%rax,8),%xmm0 | |
826 | 2.282 2.282 [55] 401d7e: addsd %xmm0,%xmm1 | |
827 | ## 4.613 4.613 [54] 401d82: add $0x1,%rax | |
828 | 0. 0. [54] 401d86: cmp %rax,%rcx | |
829 | 0. 0. [54] 401d89: jne 0xffffffffffffffea | |
830 | 56. c[i] = row_sum; | |
831 | 0. 0. [56] 401d8b: movsd %xmm1,(%r10,%rdi,8) | |
832 | 0. 0. [51] 401d91: add $0x1,%rdi | |
833 | 0. 0. [51] 401d95: cmp %rsi,%rdi | |
834 | 0. 0. [51] 401d98: je 0xd | |
835 | 0. 0. [53] 401d9a: pxor %xmm1,%xmm1 | |
836 | 0. 0. [54] 401d9e: test %rcx,%rcx | |
837 | 0. 0. [54] 401da1: jg 0xffffffffffffffc5 | |
838 | 0. 0. [54] 401da3: jmp 0xffffffffffffffe8 | |
839 | 57. } | |
840 | 58. } | |
841 | 0. 0. [58] 401da5: ret | |
842 | @end verbatim | |
843 | @end smallexample | |
844 | ||
845 | For each instruction, the timing values are given and we can immediately | |
846 | identify the most expensive instructions. As with the source level view, | |
847 | these are marked with the @code{##} symbol. | |
848 | ||
849 | It comes as no surprise that the time consuming instructions originate from | |
850 | the source code at lines 54-55. | |
851 | One thing to note is that the source line numbers no longer appear in | |
852 | sequential order. | |
853 | This is because the compiler has re-ordered the instructions as part of | |
854 | the code optimizations it has performed. | |
855 | ||
856 | As illustrated below and similar to the @command{lines} command, we can get | |
857 | an overview of the instructions executed by using the | |
858 | @IndexSubentry{Options, @code{-pcs}} | |
859 | @IndexSubentry{Commands, @code{pcs}} | |
860 | @command{pcs} | |
861 | command. | |
862 | ||
863 | @noindent | |
864 | Below the command and the output, which again has been restricted | |
865 | to 10 lines. As before, some lines have been shortened for lay-out | |
866 | purposes. | |
867 | ||
868 | @cartouche | |
869 | @smallexample | |
870 | $ gprofng display text -pcs test.1.er | |
871 | @end smallexample | |
872 | @end cartouche | |
873 | ||
874 | @smallexample | |
875 | @verbatim | |
876 | PCs sorted by metric: Exclusive Total CPU Time | |
877 | ||
878 | Excl. Total Incl. Total Name | |
879 | CPU CPU | |
880 | sec. % sec. % | |
881 | 9.367 100.00 9.367 100.00 <Total> | |
882 | 4.613 49.25 4.613 49.25 mxv_core + 0x0000002C, line 54 in "mxv.c" | |
883 | 2.282 24.36 2.282 24.36 mxv_core + 0x00000028, line 55 in "mxv.c" | |
884 | 1.921 20.51 1.921 20.51 mxv_core + 0x00000022, line 55 in "mxv.c" | |
885 | 0.150 1.60 0.150 1.60 init_data + 0x000000AC, line 118 in ... | |
886 | 0.110 1.18 0.110 1.18 mxv_core + 0x0000001D, line 55 in "mxv.c" | |
887 | 0.040 0.43 0.040 0.43 drand48 + 0x00000022 | |
888 | 0.040 0.43 0.040 0.43 init_data + 0x000000F1, line 124 in ... | |
889 | 0.030 0.32 0.030 0.32 __drand48_iterate + 0x0000001E | |
890 | 0.020 0.21 0.020 0.21 __drand48_iterate + 0x00000038 | |
891 | @end verbatim | |
892 | @end smallexample | |
893 | ||
894 | @noindent | |
895 | What we see is that the top three instructions take 94% of the total CPU time | |
896 | and any optimizations should focus on this part of the code.. | |
897 | ||
898 | @c -- A new node -------------------------------------------------------------- | |
899 | @node Display and Define the Metrics | |
900 | @subsection Display and Define the Metrics | |
901 | @c ---------------------------------------------------------------------------- | |
902 | ||
903 | The metrics shown by @DisplayText{} are useful, but there is more recorded | |
904 | than displayed by default. We can customize the values shown by defining the | |
905 | metrics ourselves. | |
906 | ||
907 | There are two commands related to changing the metrics shown: | |
908 | @IndexSubentry{Options, @code{-metric_list}} | |
909 | @IndexSubentry{Commands, @code{metric_list}} | |
910 | @command{metric_list} and | |
911 | @IndexSubentry{Options, @code{-metrics}} | |
912 | @IndexSubentry{Commands, @code{metrics}} | |
913 | @command{metrics}. | |
914 | ||
915 | The first command shows the currently selected metrics, plus all the metrics | |
916 | that have been stored as part of the experiment. The second command may be | |
917 | used to define the metric list. | |
918 | ||
919 | @noindent | |
920 | This is the way to get the information about the metrics: | |
921 | ||
922 | @IndexSubentry{Options, @code{-metric_list}} | |
923 | @IndexSubentry{Commands, @code{metric_list}} | |
924 | @cartouche | |
925 | @smallexample | |
926 | $ gprofng display text -metric_list test.1.er | |
927 | @end smallexample | |
928 | @end cartouche | |
929 | ||
930 | @noindent | |
931 | This is the output: | |
932 | ||
933 | @smallexample | |
934 | @verbatim | |
935 | Current metrics: e.%totalcpu:i.%totalcpu:name | |
936 | Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) | |
937 | Available metrics: | |
938 | Exclusive Total CPU Time: e.%totalcpu | |
939 | Inclusive Total CPU Time: i.%totalcpu | |
940 | Size: size | |
941 | PC Address: address | |
942 | Name: name | |
943 | @end verbatim | |
944 | @end smallexample | |
945 | ||
946 | This shows the metrics that are currently used, the metric that is used to sort | |
947 | the data and all the metrics that have been recorded, but are not necessarily | |
948 | shown. | |
949 | ||
950 | @cindex Default metrics | |
951 | In this case, the current metrics are set to the exclusive and inclusive | |
952 | total CPU times, the respective percentages, and the name of the function, | |
953 | or load object. | |
954 | ||
955 | @IndexSubentry{Options, @code{-metrics}} | |
956 | @IndexSubentry{Commands, @code{metrics}} | |
957 | The @code{metrics} command is used to define the metrics that need to be | |
958 | displayed. | |
959 | ||
960 | For example, to swap the exclusive and inclusive metrics, use the following | |
961 | metric definition: @code{i.%totalcpu:e.%totalcpu}. | |
962 | ||
963 | Since the metrics can be tailored for different views, there is also a way | |
964 | to reset them to the default. This is done through the special keyword | |
965 | @code{default} for the metrics definition (@command{-metrics default}). | |
966 | @IndexSubentry{Metrics, Reset to default} | |
967 | ||
968 | @c -- A new node -------------------------------------------------------------- | |
969 | @node Customization of the Output | |
970 | @subsection Customization of the Output | |
971 | @c ---------------------------------------------------------------------------- | |
972 | ||
973 | With the information just given, the function overview can be customized. | |
974 | For sake of the example, we would like to display the name of the function | |
975 | first, only followed by the exclusive CPU time, given as an absolute number | |
976 | and a percentage. | |
977 | ||
978 | Note that the commands are parsed in order of appearance. This is why we | |
979 | need to define the metrics @emph{before} requesting the function overview: | |
980 | ||
981 | @cartouche | |
982 | @smallexample | |
983 | $ gprofng display text -metrics name:e.%totalcpu -functions test.1.er | |
984 | @end smallexample | |
985 | @end cartouche | |
986 | ||
987 | @smallexample | |
988 | @verbatim | |
989 | Current metrics: name:e.%totalcpu | |
990 | Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) | |
991 | Functions sorted by metric: Exclusive Total CPU Time | |
992 | ||
993 | Name Excl. Total | |
994 | CPU | |
995 | sec. % | |
996 | <Total> 9.367 100.00 | |
997 | mxv_core 8.926 95.30 | |
998 | init_data 0.210 2.24 | |
999 | drand48 0.080 0.85 | |
1000 | erand48_r 0.070 0.75 | |
1001 | __drand48_iterate 0.060 0.64 | |
1002 | _int_malloc 0.010 0.11 | |
1003 | sysmalloc 0.010 0.11 | |
1004 | <static>@0x47960 (<libgp-collector.so>) 0. 0. | |
1005 | __libc_start_main 0. 0. | |
1006 | allocate_data 0. 0. | |
1007 | driver_mxv 0. 0. | |
1008 | main 0. 0. | |
1009 | malloc 0. 0. | |
1010 | start_thread 0. 0. | |
1011 | @end verbatim | |
1012 | @end smallexample | |
1013 | ||
1014 | This was a first and simple example how to customize the output. Note that we | |
1015 | did not rerun our profiling job and merely modified the display settings. | |
1016 | Below we will show other and also more advanced examples of customization. | |
1017 | ||
1018 | @c -- A new node -------------------------------------------------------------- | |
1019 | @node Name the Experiment Directory | |
1020 | @subsection Name the Experiment Directory | |
1021 | @c ---------------------------------------------------------------------------- | |
1022 | ||
1023 | When using @CollectApp{}, the default names for experiments work fine, but | |
1024 | they are quite generic. It is often more convenient to select a more | |
1025 | descriptive name. For example, one that reflects conditions for the experiment | |
1026 | conducted, like the number of threads used. | |
1027 | ||
1028 | For this, the mutually exclusive @code{-o} and @code{-O} options come in handy. | |
1029 | Both may be used to provide a name for the experiment directory, but the | |
1030 | behaviour of @CollectApp{} is different. | |
1031 | ||
1032 | With the | |
1033 | @IndexSubentry{Options, @code{-o}} | |
1034 | @samp{-o} | |
1035 | option, an existing experiment directory is not overwritten. Any directory | |
1036 | with the same name either needs to be renamed, moved, or removed, before the | |
1037 | experiment can be conducted. | |
1038 | ||
1039 | This is in contrast with the behaviour for the | |
1040 | @IndexSubentry{Options, @code{-O}} | |
1041 | @samp{-O} | |
1042 | option. Any existing directory with the same name is silently overwritten. | |
1043 | ||
1044 | Be aware that the name of the experiment directory has to end with @file{.er}. | |
1045 | ||
1046 | @c -- A new node -------------------------------------------------------------- | |
1047 | @node Control the Number of Lines in the Output | |
1048 | @subsection Control the Number of Lines in the Output | |
1049 | @c ---------------------------------------------------------------------------- | |
1050 | ||
1051 | @IndexSubentry{Options, @code{-limit}} | |
1052 | @IndexSubentry{Commands, @code{limit}} | |
1053 | The @command{limit} @var{<n>} command can be used to control the number of lines | |
1054 | printed in various views. For example it impacts the function view, but also | |
1055 | takes effect for other display commands, like @command{lines}. | |
1056 | ||
1057 | The argument @var{<n>} should be a positive integer number. It sets the number | |
1058 | of lines in the (function) view. A value of zero resets the limit to the | |
1059 | default. | |
1060 | ||
1061 | Be aware that the pseudo-function @code{<Total>} counts as a regular function. | |
1062 | For example @command{limit 10} displays nine user level functions. | |
1063 | ||
1064 | @c -- A new node -------------------------------------------------------------- | |
1065 | @node Sorting the Performance Data | |
1066 | @subsection Sorting the Performance Data | |
1067 | @c ---------------------------------------------------------------------------- | |
1068 | ||
1069 | @IndexSubentry{Options, @code{-sort}} | |
1070 | @IndexSubentry{Commands, @code{sort}} | |
1071 | The @command{sort} @var{<key>} command sets the key to be used when sorting the | |
1072 | performance data. | |
1073 | ||
1074 | The key is a valid metric definition, but the | |
1075 | @IndexSubentry{Metrics, Visibility field} | |
1076 | visibility field | |
1077 | (@xref{Metric Definitions}) | |
1078 | in the metric | |
1079 | definition is ignored, since this does not affect the outcome of the sorting | |
1080 | operation. | |
1081 | For example if the sort key is set to @code{e.totalcpu}, the values | |
1082 | will be sorted in descending order with respect to the exclusive total | |
1083 | CPU time. | |
1084 | ||
1085 | @IndexSubentry{Sort, Reverse order} | |
1086 | The data can be sorted in reverse order by prepending the metric definition | |
1087 | with a minus (@samp{-}) sign. For example @command{sort -e.totalcpu}. | |
1088 | ||
1089 | @IndexSubentry{Sort, Reset to default} | |
1090 | A default metric for the sort operation has been defined and since this is | |
1091 | a persistent command, this default can be restored with @code{default} as | |
1092 | the key (@command{sort default}). | |
1093 | ||
1094 | @c -- A new node -------------------------------------------------------------- | |
1095 | @node Scripting | |
1096 | @subsection Scripting | |
1097 | @c ---------------------------------------------------------------------------- | |
1098 | ||
1099 | @cindex Script files | |
1100 | The list with commands for @DisplayText{} can be very long. This is tedious | |
1101 | and also error prone. Luckily, there is an easier and elegant way to control | |
1102 | the output of this tool. | |
1103 | ||
1104 | @IndexSubentry{Options, @code{-script}} | |
1105 | @IndexSubentry{Commands, @code{script}} | |
1106 | Through the @command{script} command, the name of a file with commands can be | |
1107 | passed in. These commands are parsed and executed as if they appeared on | |
1108 | the command line in the same order as encountered in the file. The commands | |
1109 | in this script file can actually be mixed with commands on the command line | |
1110 | and multiple script files may be used. | |
1111 | The difference between the commands in the script file and those used on the | |
1112 | command line is that the latter require a leading dash (@samp{-}) symbol. | |
1113 | ||
1114 | Comment lines in a script file are supported. They need to start with the | |
1115 | @samp{#} symbol. | |
1116 | ||
1117 | @c -- A new node -------------------------------------------------------------- | |
1118 | @node A More Elaborate Example | |
1119 | @subsection A More Elaborate Example | |
1120 | @c ---------------------------------------------------------------------------- | |
1121 | ||
1122 | With the information presented so far, we can customize our data | |
1123 | gathering and display commands. | |
1124 | ||
1125 | As an example, we would like to use @file{mxv.1.thr.er} as the name for the | |
1126 | experiment directory. In this way, the name of the algorithm and the | |
1127 | number of threads that were used are included in the name. | |
1128 | We also don't mind to overwrite an existing | |
1129 | experiment directory with the same name. | |
1130 | ||
1131 | All that needs to be done is to use the | |
1132 | @IndexSubentry{Options, @code{-O}} | |
1133 | @samp{-O} | |
1134 | option, followed by the directory name of choice when running @CollectApp{}: | |
1135 | ||
1136 | @cartouche | |
1137 | @smallexample | |
1138 | $ exe=mxv-pthreads | |
1139 | $ m=8000 | |
1140 | $ n=4000 | |
1141 | $ gprofng collect app -O mxv.1.thr.er ./$exe -m $m -n $n -t 1 | |
1142 | @end smallexample | |
1143 | @end cartouche | |
1144 | ||
1145 | Since we want to customize the profile and prefer to keep the command line | |
1146 | short, the commands to generate the profile are put into a file with the | |
1147 | name @file{my-script}: | |
1148 | ||
1149 | @smallexample | |
1150 | @verbatim | |
1151 | $ cat my-script | |
1152 | # This is my first gprofng script | |
1153 | # Set the metrics | |
1154 | metrics i.%totalcpu:e.%totalcpu:name | |
1155 | # Use the exclusive time to sort | |
1156 | sort e.totalcpu | |
1157 | # Limit the function list to 5 lines | |
1158 | limit 5 | |
1159 | # Show the function list | |
1160 | functions | |
1161 | @end verbatim | |
1162 | @end smallexample | |
1163 | ||
1164 | This script file is specified as input to the @DisplayText{} command | |
1165 | that is used to display the performance information stored in experiment | |
1166 | directory @file{mxv.1.thr.er}: | |
1167 | ||
1168 | @cartouche | |
1169 | @smallexample | |
1170 | $ gprofng display text -script my-script mxv.1.thr.er | |
1171 | @end smallexample | |
1172 | @end cartouche | |
1173 | ||
1174 | This command produces the following output: | |
1175 | ||
1176 | @smallexample | |
1177 | @verbatim | |
1178 | # This is my first gprofng script | |
1179 | # Set the metrics | |
1180 | Current metrics: i.%totalcpu:e.%totalcpu:name | |
1181 | Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) | |
1182 | # Use the exclusive time to sort | |
1183 | Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) | |
1184 | # Limit the function list to 5 lines | |
1185 | Print limit set to 5 | |
1186 | # Show the function list | |
1187 | Functions sorted by metric: Exclusive Total CPU Time | |
1188 | ||
1189 | Incl. Total Excl. Total Name | |
1190 | CPU CPU | |
1191 | sec. % sec. % | |
1192 | 9.703 100.00 9.703 100.00 <Total> | |
1193 | 9.226 95.09 9.226 95.09 mxv_core | |
1194 | 0.455 4.69 0.210 2.17 init_data | |
1195 | 0.169 1.75 0.123 1.26 erand48_r | |
1196 | 0.244 2.52 0.075 0.77 drand48 | |
1197 | @end verbatim | |
1198 | @end smallexample | |
1199 | ||
1200 | In the first part of the output the comment lines in the script file are | |
1201 | echoed. These are interleaved with an acknowledgement message for the commands. | |
1202 | ||
1203 | This is followed by a profile consisting of 5 lines only. For both metrics, | |
1204 | the percentages plus the timings are given. The numbers are sorted with respect | |
1205 | to the exclusive total CPU time. Although this is the default, for | |
1206 | demonstration purposes we use the @command{sort} command to explicitly define | |
1207 | the metric for the sort. | |
1208 | ||
1209 | While we executed the same job as before and only changed the name of the | |
1210 | experiment directory, the results are somewhat different. This is sampling | |
1211 | in action. The numbers are not all that different though. | |
1212 | It is seen that function @code{mxv_core} is responsbile for | |
1213 | 95% of the CPU time and @code{init_data} takes 4.5% only. | |
1214 | ||
1215 | @c -- A new node -------------------------------------------------------------- | |
1216 | @node The Call Tree | |
1217 | @subsection The Call Tree | |
1218 | @c ---------------------------------------------------------------------------- | |
1219 | ||
1220 | The call tree shows the dynamic structure of the application by displaying the | |
1221 | functions executed and their parent. The CPU time attributed to each function | |
1222 | is shown as well. This view helps to find the most expensive | |
1223 | execution path in the program. | |
1224 | ||
1225 | @IndexSubentry{Options, @code{-calltree}} | |
1226 | @IndexSubentry{Commands, @code{calltree}} | |
1227 | This feature is enabled through the @command{calltree} command. For example, | |
1228 | this is how to get the call tree for our current experiment: | |
1229 | ||
1230 | @cartouche | |
1231 | @smallexample | |
1232 | $ gprofng display text -calltree mxv.1.thr.er | |
1233 | @end smallexample | |
1234 | @end cartouche | |
1235 | ||
1236 | This displays the following structure: | |
1237 | ||
1238 | @smallexample | |
1239 | @verbatim | |
1240 | Functions Call Tree. Metric: Attributed Total CPU Time | |
1241 | ||
1242 | Attr. Total Name | |
1243 | CPU | |
1244 | sec. % | |
1245 | 9.703 100.00 +-<Total> | |
1246 | 9.226 95.09 +-start_thread | |
1247 | 9.226 95.09 | +-<static>@0x47960 (<libgp-collector.so>) | |
1248 | 9.226 95.09 | +-driver_mxv | |
1249 | 9.226 95.09 | +-mxv_core | |
1250 | 0.477 4.91 +-__libc_start_main | |
1251 | 0.477 4.91 +-main | |
1252 | 0.455 4.69 +-init_data | |
1253 | 0.244 2.52 | +-drand48 | |
1254 | 0.169 1.75 | +-erand48_r | |
1255 | 0.047 0.48 | +-__drand48_iterate | |
1256 | 0.021 0.22 +-allocate_data | |
1257 | 0.021 0.22 | +-malloc | |
1258 | 0.021 0.22 | +-_int_malloc | |
1259 | 0.006 0.06 | +-sysmalloc | |
1260 | 0.003 0.03 | +-__default_morecore | |
1261 | 0.003 0.03 | +-sbrk | |
1262 | 0.003 0.03 | +-brk | |
1263 | 0.001 0.01 +-pthread_create | |
1264 | 0.001 0.01 +-__pthread_create_2_1 | |
1265 | @end verbatim | |
1266 | @end smallexample | |
1267 | ||
1268 | At first sight this may not be what is expected and some explanation is in | |
1269 | place. | |
1270 | ||
1271 | @c ---------------------------------------------------------------------------- | |
1272 | @c TBD: Revise this text when we have user and machine mode. | |
1273 | @c ---------------------------------------------------------------------------- | |
1274 | The top function is the pseudo-function @code{<Total>} that we have seen | |
1275 | before. It is introduced and shown here to provide the total value of the | |
1276 | metric(s). | |
1277 | ||
1278 | We also see function @code{<static>@@0x47960} in the call tree and apparently | |
1279 | it is from @code{libgp-collector.so}, a library that is internal to | |
1280 | @ToolName{}. | |
1281 | The @code{<static>} marker, followed by the program counter, is shown if the | |
1282 | name of the function cannot be found. This function is part of the | |
1283 | implementation of the data collection process and should be hidden to the | |
1284 | user. This is part of a planned future enhancement. | |
1285 | ||
1286 | In general, if a view has a function that does not appear to be part of the | |
1287 | user code, or seems odd anyhow, the @command{objects} and @command{fsingle} | |
1288 | @IndexSubentry{Options, @code{-objects}} | |
1289 | @IndexSubentry{Commands, @code{objects}} | |
1290 | @IndexSubentry{Options, @code{-fsingle}} | |
1291 | @IndexSubentry{Commands, @code{fsingle}} | |
1292 | commands are very useful | |
1293 | to find out more about load objects in general, but also to help identify | |
1294 | an unknown entry in the function overview. @xref{Load Objects and Functions}. | |
1295 | ||
1296 | Another thing to note is that there are two main branches. The one under | |
1297 | @code{<static>@@0x47960} and the second one under @code{__libc_start_main}. | |
1298 | This reflects the fact that this is a multithreaded program and the | |
1299 | threaded part shows up as a separate branch in the call tree. | |
1300 | ||
1301 | The way to interpret this structure is as follows. The program starts | |
1302 | under control of @code{__libc_start_main}. This executes the main program | |
1303 | called @code{main}, which at the top level executes functions | |
1304 | @code{init_data}, @code{allocate_data}, and @code{pthread_create}. | |
1305 | The latter function creates and executes the additional thread(s). | |
1306 | ||
1307 | For this multithreaded part of the code, we need to look at the branch | |
1308 | under function @code{start_thread} that calls the driver code for the | |
1309 | matrix-vector multiplication (@code{driver_mxv}), which executes the function | |
1310 | that performs the actual multiplication (@code{mxv_core}). | |
1311 | ||
1312 | There are two things worth noting for the call tree feature: | |
1313 | ||
1314 | @itemize | |
1315 | ||
1316 | @item | |
1317 | This is a dynamic tree and since sampling is used, it most likely looks | |
1318 | slighlty different across seemingly identical profile runs. In case the | |
1319 | run times are short, it is worth considering to use a high resolution | |
1320 | through the | |
1321 | @IndexSubentry{Options, @code{-p}} | |
1322 | @samp{-p} | |
1323 | option. For example use @samp{-p hi} to increase the sampling rate. | |
1324 | ||
1325 | @item | |
1326 | In case hardware event counters have been enabled | |
1327 | (@xref{Profile Hardware Event Counters}), these values are also displayed | |
1328 | in the call tree view. | |
1329 | ||
1330 | @end itemize | |
1331 | ||
1332 | @c -- A new node -------------------------------------------------------------- | |
1333 | @node More Information on the Experiment | |
1334 | @subsection More Information on the Experiment | |
1335 | @c ---------------------------------------------------------------------------- | |
1336 | ||
1337 | The experiment directory not only contains performance related data. Several | |
1338 | system characteristics, the profiling command executed, plus some global | |
1339 | performance statistics are stored and can be displayed. | |
1340 | ||
1341 | @IndexSubentry{Options, @code{-header}} | |
1342 | @IndexSubentry{Commands, @code{header}} | |
1343 | The @command{header} command displays information about the experiment(s). | |
1344 | For example, this is command is used to extract this data from for our | |
1345 | experiment directory: | |
1346 | ||
1347 | @cartouche | |
1348 | @smallexample | |
1349 | $ gprofng display text -header mxv.1.thr.er | |
1350 | @end smallexample | |
1351 | @end cartouche | |
1352 | ||
1353 | The above command prints the following information. Note that some of the | |
1354 | lay-out and the information has been modified. Directory paths have been | |
1355 | replaced @code{<apath>} for example. Textual changes are | |
1356 | marked with the @samp{<} and @samp{>} symbols. | |
1357 | ||
1358 | @smallexample | |
1359 | @verbatim | |
1360 | Experiment: mxv.1.thr.er | |
1361 | No errors | |
1362 | No warnings | |
1363 | Archive command ` /usr/bin/gp-archive -n -a on --outfile | |
1364 | <apath>/archive.log <apath>/mxv.1.thr.er' | |
1365 | ||
1366 | Target command (64-bit): './mxv-pthreads -m 8000 -n 4000 -t 1' | |
1367 | Process pid 2750071, ppid 2750069, pgrp 2749860, sid 2742080 | |
1368 | Current working directory: <apath> | |
1369 | Collector version: `2.40.00'; experiment version 12.4 (64-bit) | |
1370 | Host `<the-host-name>', OS `Linux <version>', page size 4096, | |
1371 | architecture `x86_64' | |
1372 | 4 CPUs, clock speed 2294 MHz. | |
1373 | Memory: 3506491 pages @ 4096 = 13697 MB. | |
1374 | Data collection parameters: | |
1375 | Clock-profiling, interval = 997 microsecs. | |
1376 | Periodic sampling, 1 secs. | |
1377 | Follow descendant processes from: fork|exec|combo | |
1378 | ||
1379 | Experiment started <date and time> | |
1380 | ||
1381 | Experiment Ended: 9.801216173 | |
1382 | Data Collection Duration: 9.801216173 | |
1383 | @end verbatim | |
1384 | @end smallexample | |
1385 | ||
1386 | The output above may assist in troubleshooting, or to verify some of the | |
1387 | operational conditions and we recommend to include this command when | |
1388 | generating a profile. | |
1389 | ||
1390 | @IndexSubentry{Options, @code{-C}} | |
1391 | Related to this command there is a useful option to record comment(s) in | |
1392 | an experiment. | |
1393 | To this end, use the @samp{-C} option on the @CollectApp{} tool to | |
1394 | specify a comment string. Up to ten comment lines can be included. | |
1395 | These comments are displayed with the @command{header} command on | |
1396 | the @DisplayText{} tool. | |
1397 | ||
1398 | @IndexSubentry{Options, @code{-overview}} | |
1399 | @IndexSubentry{Commands, @code{overview}} | |
1400 | The @command{overview} command displays information on the experiment(s) and | |
1401 | also shows a summary of the values for the metric(s) used. This is an example | |
1402 | how to use it on the newly created experiment directory: | |
1403 | ||
1404 | @cartouche | |
1405 | @smallexample | |
1406 | $ gprofng display text -overview mxv.1.thr.er | |
1407 | @end smallexample | |
1408 | @end cartouche | |
1409 | ||
1410 | @smallexample | |
1411 | @verbatim | |
1412 | Experiment(s): | |
1413 | ||
1414 | Experiment :mxv.1.thr.er | |
1415 | Target : './mxv-pthreads -m 8000 -n 4000 -t 1' | |
1416 | Host : <hostname> (<ISA>, Linux <version>) | |
1417 | Start Time : <date and time> | |
1418 | Duration : 9.801 Seconds | |
1419 | ||
1420 | Metrics: | |
1421 | ||
1422 | Experiment Duration (Seconds): [9.801] | |
1423 | Clock Profiling | |
1424 | [X]Total CPU Time - totalcpu (Seconds): [*9.703] | |
1425 | ||
1426 | Notes: '*' indicates hot metrics, '[X]' indicates currently enabled | |
1427 | metrics. | |
1428 | The metrics command can be used to change selections. The | |
1429 | metric_list command lists all available metrics. | |
1430 | @end verbatim | |
1431 | @end smallexample | |
1432 | ||
1433 | This command provides a dashboard overview that helps to easily identify | |
1434 | where the time is spent and in case hardware event counters are used, it | |
1435 | shows their total values. | |
1436 | ||
1437 | @c -- A new node -------------------------------------------------------------- | |
1438 | @node Control the Sampling Frequency | |
1439 | @subsection Control the Sampling Frequency | |
1440 | @c ---------------------------------------------------------------------------- | |
1441 | ||
1442 | @cindex Sampling frequency | |
1443 | So far we did not go into details on the frequency of the sampling process, | |
1444 | but in some cases it is useful to change the default of 10 milliseconds. | |
1445 | ||
1446 | The advantage of increasing the sampling frequency is that functions that | |
1447 | do not take much time per invocation are more accurately captured. The | |
1448 | downside is that more data is gathered. This has an impact on the overhead | |
1449 | of the collection process and more disk space is required. | |
1450 | ||
1451 | In general this is not an immediate concern, but with heavily threaded | |
1452 | applications that run for an extended period of time, increasing the | |
1453 | frequency may have a more noticeable impact. | |
1454 | ||
1455 | @IndexSubentry{Options, @code{-p}} | |
1456 | The @code{-p} option on the @CollectApp{} tool is used to enable or disable | |
1457 | clock based profiling, or to explicitly set the sampling rate. | |
1458 | @cindex Sampling interval | |
1459 | This option takes one of the following keywords: | |
1460 | ||
1461 | @table @code | |
1462 | ||
1463 | @item off | |
1464 | Disable clock based profiling. | |
1465 | ||
1466 | @item on | |
1467 | Enable clock based profiling with a per thread sampling interval of 10 ms. | |
1468 | This is the default. | |
1469 | ||
1470 | @item lo | |
1471 | Enable clock based profiling with a per thread sampling interval of 100 ms. | |
1472 | ||
1473 | @item hi | |
1474 | Enable clock based profiling with a per thread sampling interval of 1 ms. | |
1475 | ||
1476 | @item @var{value} | |
1477 | @cindex Sampling interval | |
1478 | Enable clock based profiling with a per thread sampling interval of | |
1479 | @var{value}. | |
1480 | ||
1481 | @end table | |
1482 | ||
1483 | It may seem unnecessary to have an option to disable clock based profiling, | |
1484 | but there is a good reason to support this. | |
1485 | By default, clock profiling is enabled when conducting hardware event counter | |
1486 | experiments (@xref{Profile Hardware Event Counters}). | |
1487 | With the @code{-p off} option, this can be disabled. | |
1488 | ||
1489 | If an explicit value is set for the sampling, the number can be an integer or a | |
1490 | floating-point number. | |
1491 | A suffix of @samp{u} for microseconds, or @samp{m} for milliseconds is supported. | |
1492 | If no suffix is used, the value is assumed to be in milliseconds. | |
1493 | ||
1494 | For example, the following command sets the sampling rate to | |
1495 | 5123.4 microseconds: | |
1496 | ||
1497 | @cartouche | |
1498 | @smallexample | |
1499 | $ gprofng collect app -p 5123.4u ./mxv-pthreads -m 8000 -n 4000 -t 1 | |
1500 | @end smallexample | |
1501 | @end cartouche | |
1502 | ||
1503 | If the value is smaller than the clock profiling minimum, a warning message is issued | |
1504 | and it is set to the minimum. | |
1505 | In case it is not a multiple of the clock profiling resolution, it is silently rounded | |
1506 | down to the nearest multiple of the clock resolution. | |
1507 | If the value exceeds the clock profiling maximum, is negative, or zero, an error is | |
1508 | reported. | |
1509 | ||
1510 | @IndexSubentry{Options, @code{-header}} | |
1511 | @IndexSubentry{Commands, @code{header}} | |
1512 | @noindent | |
1513 | Note that the @code{header} command echoes the sampling rate used. | |
1514 | ||
1515 | @c -- A new node -------------------------------------------------------------- | |
1516 | @node Information on Load Objects | |
1517 | @subsection Information on Load Objects | |
1518 | @c ---------------------------------------------------------------------------- | |
1519 | ||
1520 | It may happen that the function view shows a function that is not known to | |
1521 | the user. This can easily happen with library functions for example. | |
1522 | Luckily there are three commands that come in handy then. | |
1523 | ||
1524 | @IndexSubentry{Options, @code{-objects}} | |
1525 | @IndexSubentry{Commands, @code{objects}} | |
1526 | @IndexSubentry{Options, @code{-fsingle}} | |
1527 | @IndexSubentry{Commands, @code{fsingle}} | |
1528 | @IndexSubentry{Options, @code{-fsummary}} | |
1529 | @IndexSubentry{Commands, @code{fsummary}} | |
1530 | These commands are @command{objects}, @command{fsingle}, and @command{fsummary}. | |
1531 | They provide details on | |
1532 | @cindex Load objects | |
1533 | load objects (@xref{Load Objects and Functions}). | |
1534 | ||
1535 | The @command{objects} command lists all load objects that have been referenced | |
1536 | during the performance experiment. | |
1537 | Below we show the command and the result for our profile job. Like before, | |
1538 | some path names in the output have been shortened and replaced by the | |
1539 | @IndexSubentry{Miscellaneous, @code{<apath>}} | |
1540 | @code{<apath>} symbol that represents an absolute directory path. | |
1541 | ||
1542 | @cartouche | |
1543 | @smallexample | |
1544 | $ gprofng display text -objects mxv.1.thr.er | |
1545 | @end smallexample | |
1546 | @end cartouche | |
1547 | ||
1548 | The output includes the name and path of the target executable: | |
1549 | ||
1550 | @smallexample | |
1551 | @verbatim | |
1552 | <Unknown> (<Unknown>) | |
1553 | <mxv-pthreads> (<apath>/mxv-pthreads) | |
1554 | <libdl-2.28.so> (/usr/lib64/libdl-2.28.so) | |
1555 | <librt-2.28.so> (/usr/lib64/librt-2.28.so) | |
1556 | <libc-2.28.so> (/usr/lib64/libc-2.28.so) | |
1557 | <libpthread-2.28.so> (/usr/lib64/libpthread-2.28.so) | |
1558 | <libm-2.28.so> (/usr/lib64/libm-2.28.so) | |
1559 | <libgp-collector.so> (/usr/lib64/gprofng/libgp-collector.so) | |
1560 | <ld-2.28.so> (/usr/lib64/ld-2.28.so) | |
1561 | <DYNAMIC_FUNCTIONS> (DYNAMIC_FUNCTIONS) | |
1562 | @end verbatim | |
1563 | @end smallexample | |
1564 | ||
1565 | @IndexSubentry{Options, @code{-fsingle}} | |
1566 | @IndexSubentry{Commands, @code{fsingle}} | |
1567 | The @command{fsingle} command may be used to get more details on a specific entry | |
1568 | in the function view, say. For example, the command below provides additional | |
1569 | information on the @code{pthread_create} function shown in the function overview. | |
1570 | ||
1571 | @cartouche | |
1572 | @smallexample | |
1573 | $ gprofng display text -fsingle pthread_create mxv.1.thr.er | |
1574 | @end smallexample | |
1575 | @end cartouche | |
1576 | ||
1577 | Below the output from this command. It has been somewhat modified to match the | |
1578 | display requirements. | |
1579 | ||
1580 | @smallexample | |
1581 | @verbatim | |
1582 | + gprofng display text -fsingle pthread_create mxv.1.thr.er | |
1583 | pthread_create | |
1584 | Exclusive Total CPU Time: 0. ( 0. %) | |
1585 | Inclusive Total CPU Time: 0.001 ( 0.0%) | |
1586 | Size: 258 | |
1587 | PC Address: 8:0x00049f60 | |
1588 | Source File: (unknown) | |
1589 | Object File: (unknown) | |
1590 | Load Object: /usr/lib64/gprofng/libgp-collector.so | |
1591 | Mangled Name: | |
1592 | Aliases: | |
1593 | @end verbatim | |
1594 | @end smallexample | |
1595 | ||
1596 | In this table we not only see how much time was spent in this function, we | |
1597 | also see where it originates from. In addition to this, the size and start | |
1598 | address are given as well. If the source code location is known it is also | |
1599 | shown here. | |
1600 | ||
1601 | @IndexSubentry{Options, @code{-fsummary}} | |
1602 | @IndexSubentry{Commands, @code{fsummary}} | |
1603 | The related @code{fsummary} command displays the same information as | |
1604 | @code{fsingle}, but for all functions in the function overview, | |
1605 | including @code{<Total>}: | |
1606 | ||
1607 | @cartouche | |
1608 | @smallexample | |
1609 | $ gprofng display text -fsummary mxv.1.thr.er | |
1610 | @end smallexample | |
1611 | @end cartouche | |
1612 | ||
1613 | @smallexample | |
1614 | @verbatim | |
1615 | Functions sorted by metric: Exclusive Total CPU Time | |
1616 | ||
1617 | <Total> | |
1618 | Exclusive Total CPU Time: 9.703 (100.0%) | |
1619 | Inclusive Total CPU Time: 9.703 (100.0%) | |
1620 | Size: 0 | |
1621 | PC Address: 1:0x00000000 | |
1622 | Source File: (unknown) | |
1623 | Object File: (unknown) | |
1624 | Load Object: <Total> | |
1625 | Mangled Name: | |
1626 | Aliases: | |
1627 | ||
1628 | mxv_core | |
1629 | Exclusive Total CPU Time: 9.226 ( 95.1%) | |
1630 | Inclusive Total CPU Time: 9.226 ( 95.1%) | |
1631 | Size: 80 | |
1632 | PC Address: 2:0x00001d56 | |
1633 | Source File: <apath>/src/mxv.c | |
1634 | Object File: mxv.1.thr.er/archives/mxv-pthreads_ss_pf53V__5 | |
1635 | Load Object: <apath>/mxv-pthreads | |
1636 | Mangled Name: | |
1637 | Aliases: | |
1638 | ||
1639 | ... etc ... | |
1640 | @end verbatim | |
1641 | @end smallexample | |
1642 | ||
1643 | @c -- A new node -------------------------------------------------------------- | |
1644 | @node Support for Multithreading | |
1645 | @section Support for Multithreading | |
1646 | @c ---------------------------------------------------------------------------- | |
1647 | ||
1648 | In this chapter the support for multithreading is introduced and discussed. | |
1649 | As is shown below, nothing needs to be changed when collecting the performance | |
1650 | data. | |
1651 | ||
1652 | The difference is that additional commands are available to get more | |
1653 | information on the multithreading details, plus that several filters allow | |
1654 | the user to zoom in on specific threads. | |
1655 | ||
1656 | @c -- A new node -------------------------------------------------------------- | |
1657 | @node Creating a Multithreading Experiment | |
1658 | @subsection Creating a Multithreading Experiment | |
1659 | @c ---------------------------------------------------------------------------- | |
1660 | ||
1661 | We demonstrate the support for multithreading using the same code and settings | |
1662 | as before, but this time 2 threads are used: | |
1663 | ||
1664 | @cartouche | |
1665 | @smallexample | |
1666 | $ exe=mxv-pthreads | |
1667 | $ m=8000 | |
1668 | $ n=4000 | |
1669 | $ gprofng collect app -O mxv.2.thr.er ./$exe -m $m -n $n -t 2 | |
1670 | @end smallexample | |
1671 | @end cartouche | |
1672 | ||
1673 | First of all, in as far as @ProductName{} is concerned, no changes are needed. | |
1674 | Nothing special is needed to profile a multithreaded job when using @ToolName{}. | |
1675 | ||
1676 | The same is true when displaying the performance results. The same commands | |
1677 | that were used before work unmodified. For example, this is all that is needed to | |
1678 | get a function overview: | |
1679 | ||
1680 | @cartouche | |
1681 | @smallexample | |
1682 | $ gprofng display text -limit 5 -functions mxv.2.thr.er | |
1683 | @end smallexample | |
1684 | @end cartouche | |
1685 | ||
1686 | This produces the following familiar looking output: | |
1687 | ||
1688 | @smallexample | |
1689 | @verbatim | |
1690 | Print limit set to 5 | |
1691 | Functions sorted by metric: Exclusive Total CPU Time | |
1692 | ||
1693 | Excl. Total Incl. Total Name | |
1694 | CPU CPU | |
1695 | sec. % sec. % | |
1696 | 9.464 100.00 9.464 100.00 <Total> | |
1697 | 8.961 94.69 8.961 94.69 mxv_core | |
1698 | 0.224 2.37 0.469 4.95 init_data | |
1699 | 0.105 1.11 0.177 1.88 erand48_r | |
1700 | 0.073 0.77 0.073 0.77 __drand48_iterate | |
1701 | @end verbatim | |
1702 | @end smallexample | |
1703 | ||
1704 | @c -- A new node -------------------------------------------------------------- | |
1705 | @node Commands Specific to Multithreading | |
1706 | @subsection Commands Specific to Multithreading | |
1707 | @c ---------------------------------------------------------------------------- | |
1708 | ||
1709 | The function overview shown above shows the results aggregated over all the | |
1710 | threads. The interesting new element is that we can also look at the | |
1711 | performance data for the individual threads. | |
1712 | ||
1713 | @IndexSubentry{Options, @code{-thread_list}} | |
1714 | @IndexSubentry{Commands, @code{thread_list}} | |
1715 | The @command{thread_list} command displays how many threads have been used: | |
1716 | ||
1717 | @cartouche | |
1718 | @smallexample | |
1719 | $ gprofng display text -thread_list mxv.2.thr.er | |
1720 | @end smallexample | |
1721 | @end cartouche | |
1722 | ||
1723 | This produces the following output, showing that three threads have | |
1724 | been used: | |
1725 | ||
1726 | @smallexample | |
1727 | @verbatim | |
1728 | Exp Sel Total | |
1729 | === === ===== | |
1730 | 1 all 3 | |
1731 | @end verbatim | |
1732 | @end smallexample | |
1733 | ||
1734 | The output confirms there is one experiment and that by default all | |
1735 | threads are selected. | |
1736 | ||
1737 | It may seem surprising to see three threads here, since we used the | |
1738 | @code{-t 2} option, but it is common for a Pthreads program to use one | |
1739 | additional thread. | |
1740 | Typically, there is one main thread that runs from start to finish. | |
1741 | It handles the sequential portions of the code, as well as thread | |
1742 | management related tasks. | |
1743 | It is no different in the example code. At some point, the main thread | |
1744 | creates and activates the two threads that perform the multiplication | |
1745 | of the matrix with the vector. Upon completion of this computation, | |
1746 | the main thread continues. | |
1747 | ||
1748 | @IndexSubentry{Options, @code{-threads}} | |
1749 | @IndexSubentry{Commands, @code{threads}} | |
1750 | The @command{threads} command is simple, yet very powerful. It shows the | |
1751 | total value of the metrics for each thread. | |
1752 | ||
1753 | @cartouche | |
1754 | @smallexample | |
1755 | $ gprofng display text -threads mxv.2.thr.er | |
1756 | @end smallexample | |
1757 | @end cartouche | |
1758 | ||
1759 | @noindent | |
1760 | The command above produces the following overview: | |
1761 | ||
1762 | @smallexample | |
1763 | @verbatim | |
1764 | Objects sorted by metric: Exclusive Total CPU Time | |
1765 | ||
1766 | Excl. Total Name | |
1767 | CPU | |
1768 | sec. % | |
1769 | 9.464 100.00 <Total> | |
1770 | 4.547 48.05 Process 1, Thread 3 | |
1771 | 4.414 46.64 Process 1, Thread 2 | |
1772 | 0.502 5.31 Process 1, Thread 1 | |
1773 | @end verbatim | |
1774 | @end smallexample | |
1775 | ||
1776 | The first line gives the total CPU time accumulated over the threads | |
1777 | selected. This is followed by the metric value(s) for each thread. | |
1778 | ||
1779 | From this it is clear that the main thread is responsible for a | |
1780 | little over 5% of the total CPU time, while the other two threads | |
1781 | take 47-48% each. | |
1782 | ||
1783 | This view is ideally suited to verify if there are any load balancing | |
1784 | issues and also to find the most time consuming thread(s). | |
1785 | ||
1786 | @IndexSubentry{Filters, Thread selection} | |
1787 | While useful, often more information than this is needed. This is | |
1788 | @IndexSubentry{Options, @code{-thread_select}} | |
1789 | @IndexSubentry{Commands, @code{thread_select}} | |
1790 | where the thread selection filter comes in. Through the | |
1791 | @command{thread_select} | |
1792 | command, one or more threads may be selected. | |
1793 | @xref{The Selection List} how to define the selection list. | |
1794 | ||
1795 | Since it is most common to use this command in a script, we do so as | |
1796 | well here. Below the script we are using: | |
1797 | ||
1798 | @cartouche | |
1799 | @smallexample | |
1800 | # Define the metrics | |
1801 | metrics e.%totalcpu | |
1802 | # Limit the output to 5 lines | |
1803 | limit 5 | |
1804 | # Get the function overview for thread 1 | |
1805 | thread_select 1 | |
1806 | functions | |
1807 | # Get the function overview for thread 2 | |
1808 | thread_select 2 | |
1809 | functions | |
1810 | # Get the function overview for thread 3 | |
1811 | thread_select 3 | |
1812 | functions | |
1813 | @end smallexample | |
1814 | @end cartouche | |
1815 | ||
1816 | The definition of the metrics and the output limit have been shown and | |
1817 | explained earlier. The new command to focus on is @command{thread_select}. | |
1818 | ||
1819 | This command takes a list (@xref{The Selection List}) to select specific | |
1820 | threads. In this case, the individual thread numbers that were | |
1821 | obtained earlier with the @command{thread_list} command are selected. | |
1822 | ||
1823 | This restricts the output of the @command{functions} command to the thread | |
1824 | number(s) specified. This means that the script above shows which | |
1825 | function(s) each thread executes and how much CPU time they consumed. | |
1826 | Both the exclusive timings and their percentages are given. | |
1827 | ||
1828 | Note that technically this command is a filter and persistent. The | |
1829 | selection remains active until changed through another thread selection | |
1830 | command, or when it is reset with the @samp{all} selection list. | |
1831 | ||
1832 | @noindent | |
1833 | This is the relevant part of the output for the first thread: | |
1834 | ||
1835 | @smallexample | |
1836 | @verbatim | |
1837 | Exp Sel Total | |
1838 | === === ===== | |
1839 | 1 1 3 | |
1840 | Functions sorted by metric: Exclusive Total CPU Time | |
1841 | ||
1842 | Excl. Total Name | |
1843 | CPU | |
1844 | sec. % | |
1845 | 0.502 100.00 <Total> | |
1846 | 0.224 44.64 init_data | |
1847 | 0.105 20.83 erand48_r | |
1848 | 0.073 14.48 __drand48_iterate | |
1849 | 0.067 13.29 drand48 | |
1850 | @end verbatim | |
1851 | @end smallexample | |
1852 | ||
1853 | As usual, the comment lines are echoed. This is followed by a confirmation | |
1854 | of the selection. The first table shows that one experiment is loaded and | |
1855 | that thread 1 out of the three threads has been selected. What is | |
1856 | displayed next is the function overview for this particular thread. Due to | |
1857 | the @code{limit 5} command, there are only five functions in this list. | |
1858 | ||
1859 | Clearly, this thread handles the data initialization part and as we know | |
1860 | from the call tree output, function @code{init_data} executes the 3 other | |
1861 | functions shown in this profile. | |
1862 | ||
1863 | Below are the overviews for threads 2 and 3 respectively. It is seen that all | |
1864 | of the CPU time is spent in function @code{mxv_core} and that this time | |
1865 | is approximately the same for both threads. | |
1866 | ||
1867 | @smallexample | |
1868 | @verbatim | |
1869 | # Get the function overview for thread 2 | |
1870 | Exp Sel Total | |
1871 | === === ===== | |
1872 | 1 2 3 | |
1873 | Functions sorted by metric: Exclusive Total CPU Time | |
1874 | ||
1875 | Excl. Total Name | |
1876 | CPU | |
1877 | sec. % | |
1878 | 4.414 100.00 <Total> | |
1879 | 4.414 100.00 mxv_core | |
1880 | 0. 0. <static>@0x48630 (<libgp-collector.so>) | |
1881 | 0. 0. driver_mxv | |
1882 | 0. 0. start_thread | |
1883 | ||
1884 | # Get the function overview for thread 3 | |
1885 | Exp Sel Total | |
1886 | === === ===== | |
1887 | 1 3 3 | |
1888 | Functions sorted by metric: Exclusive Total CPU Time | |
1889 | ||
1890 | Excl. Total Name | |
1891 | CPU | |
1892 | sec. % | |
1893 | 4.547 100.00 <Total> | |
1894 | 4.547 100.00 mxv_core | |
1895 | 0. 0. <static>@0x48630 (<libgp-collector.so>) | |
1896 | 0. 0. driver_mxv | |
1897 | 0. 0. start_thread | |
1898 | @end verbatim | |
1899 | @end smallexample | |
1900 | ||
1901 | When analyzing the performance of a multithreaded application, it is sometimes | |
1902 | useful to know whether threads have mostly executed on the same core, say, or | |
1903 | if they have wandered across multiple cores. This sort of stickiness is usually | |
1904 | referred to as | |
1905 | @cindex Thread affinity | |
1906 | @emph{thread affinity}. | |
1907 | ||
1908 | Similar to the commands for the threads, there are several commands related | |
1909 | to the usage of the cores, or @emph{CPUs} as they are called in @ToolName{} | |
46c56757 | 1910 | (@xref{The Concept of a CPU in gprofng}). |
7a515757 VM |
1911 | |
1912 | @IndexSubentry{Options, @code{-cpu_list}} | |
1913 | @IndexSubentry{Commands, @code{cpu_list}} | |
1914 | Similar to the @command{thread_list} command, the @command{cpu_list} command | |
1915 | displays how many CPUs have been used. | |
1916 | @IndexSubentry{Options, @code{-cpus}} | |
1917 | @IndexSubentry{Commands, @code{cpus}} | |
1918 | The equivalent of the @command{threads} threads command, is the @command{cpus} | |
1919 | command, which shows the numbers of the CPUs that were used and the metric values | |
1920 | for each one of them. Both commands are demonstrated below. | |
1921 | ||
1922 | @cartouche | |
1923 | @smallexample | |
1924 | $ gprofng display text -cpu_list -cpus mxv.2.thr.er | |
1925 | @end smallexample | |
1926 | @end cartouche | |
1927 | ||
1928 | @noindent | |
1929 | This command produces the following output: | |
1930 | ||
1931 | @smallexample | |
1932 | @verbatim | |
1933 | + gprofng display text -cpu_list -cpus mxv.2.thr.er | |
1934 | Exp Sel Total | |
1935 | === === ===== | |
1936 | 1 all 4 | |
1937 | Objects sorted by metric: Exclusive Total CPU Time | |
1938 | ||
1939 | Excl. Total Name | |
1940 | CPU | |
1941 | sec. % | |
1942 | 9.464 100.00 <Total> | |
1943 | 4.414 46.64 CPU 2 | |
1944 | 2.696 28.49 CPU 0 | |
1945 | 1.851 19.56 CPU 1 | |
1946 | 0.502 5.31 CPU 3 | |
1947 | @end verbatim | |
1948 | @end smallexample | |
1949 | ||
1950 | The first table shows that there is only one experiment and that all of the | |
1951 | four CPUs have been selected. The second table shows the exclusive metrics | |
1952 | for each of the CPUs that have been used. | |
1953 | ||
1954 | As also echoed in the output, the data is sorted with respect to the | |
1955 | exclusive CPU time, but it is very easy to sort the data by the CPU id | |
1956 | @IndexSubentry{Options, -sort} | |
1957 | @IndexSubentry{Commands, sort} | |
1958 | by using the @command{sort} command: | |
1959 | ||
1960 | @cartouche | |
1961 | @smallexample | |
1962 | $ gprofng display text -cpu_list -sort name -cpus mxv.2.thr.er | |
1963 | @end smallexample | |
1964 | @end cartouche | |
1965 | ||
1966 | @noindent | |
1967 | With the @command{sort} added, the output is as follows: | |
1968 | ||
1969 | @smallexample | |
1970 | @verbatim | |
1971 | Exp Sel Total | |
1972 | === === ===== | |
1973 | 1 all 4 | |
1974 | Current Sort Metric: Name ( name ) | |
1975 | Objects sorted by metric: Name | |
1976 | ||
1977 | Excl. Total Name | |
1978 | CPU | |
1979 | sec. % | |
1980 | 9.464 100.00 <Total> | |
1981 | 2.696 28.49 CPU 0 | |
1982 | 1.851 19.56 CPU 1 | |
1983 | 4.414 46.64 CPU 2 | |
1984 | 0.502 5.31 CPU 3 | |
1985 | @end verbatim | |
1986 | @end smallexample | |
1987 | ||
1988 | While the table with thread times shown earlier may point at a load imbalance | |
1989 | in the application, this overview has a different purpose. | |
1990 | ||
1991 | For example, we see that 4 CPUs have been used, but we know that the | |
1992 | application uses 3 threads only. | |
1993 | We will now demonstrate how filters can be used to help answer the | |
1994 | question why 4 CPUs are used, while the application has 3 threads only. | |
1995 | This means that at least one thread has executed on more than one CPU. | |
1996 | ||
1997 | Recall the thread level timings: | |
1998 | ||
1999 | @smallexample | |
2000 | @verbatim | |
2001 | Excl. Total Name | |
2002 | CPU | |
2003 | sec. % | |
2004 | 9.464 100.00 <Total> | |
2005 | 4.547 48.05 Process 1, Thread 3 | |
2006 | 4.414 46.64 Process 1, Thread 2 | |
2007 | 0.502 5.31 Process 1, Thread 1 | |
2008 | @end verbatim | |
2009 | @end smallexample | |
2010 | ||
2011 | Compared to the CPU timings above, it seems very likely that thread 3 has | |
2012 | used more than one CPU, because the thread and CPU timings are the same | |
2013 | for both other threads. | |
2014 | ||
2015 | The command below selects thread number 3 and then requests the CPU | |
2016 | utilization for this thread: | |
2017 | ||
2018 | @cartouche | |
2019 | @smallexample | |
2020 | $ gprofng display text -thread_select 3 -sort name -cpus mxv.2.thr.er | |
2021 | @end smallexample | |
2022 | @end cartouche | |
2023 | ||
2024 | The output shown below confirms that thread 3 is selected and then displays | |
2025 | the CPU(s) that have been used by this thread: | |
2026 | ||
2027 | @smallexample | |
2028 | @verbatim | |
2029 | Exp Sel Total | |
2030 | === === ===== | |
2031 | 1 3 3 | |
2032 | ||
2033 | Objects sorted by metric: Exclusive Total CPU Time | |
2034 | ||
2035 | Excl. Total Name | |
2036 | CPU | |
2037 | sec. % | |
2038 | 4.547 100.00 <Total> | |
2039 | 2.696 59.29 CPU 0 | |
2040 | 1.851 40.71 CPU 1 | |
2041 | @end verbatim | |
2042 | @end smallexample | |
2043 | ||
2044 | The results show that this thread has used CPU 0 nearly 60% of the time | |
2045 | and CPU 1 for the remaining 40%. | |
2046 | ||
2047 | To confirm that this is the only thread that has used more than one CPU, the | |
2048 | same approach can be used for threads 1 and 2: | |
2049 | ||
2050 | @smallexample | |
2051 | @verbatim | |
2052 | $ gprofng display text -thread_select 1 -cpus mxv.2.thr.er | |
2053 | Exp Sel Total | |
2054 | === === ===== | |
2055 | 1 1 3 | |
2056 | Objects sorted by metric: Exclusive Total CPU Time | |
2057 | ||
2058 | Excl. Total Name | |
2059 | CPU | |
2060 | sec. % | |
2061 | 0.502 100.00 <Total> | |
2062 | 0.502 100.00 CPU 3 | |
2063 | @end verbatim | |
2064 | @end smallexample | |
2065 | ||
2066 | @smallexample | |
2067 | @verbatim | |
2068 | $ gprofng display text -thread_select 2 -cpus mxv.2.thr.er | |
2069 | Exp Sel Total | |
2070 | === === ===== | |
2071 | 1 2 3 | |
2072 | Objects sorted by metric: Exclusive Total CPU Time | |
2073 | ||
2074 | Excl. Total Name | |
2075 | CPU | |
2076 | sec. % | |
2077 | 4.414 100.00 <Total> | |
2078 | 4.414 100.00 CPU 2 | |
2079 | @end verbatim | |
2080 | @end smallexample | |
2081 | ||
2082 | @noindent | |
2083 | The output above shows that indeed threads 1 and 2 each have used a single | |
2084 | CPU only. | |
2085 | ||
2086 | @c -- A new node -------------------------------------------------------------- | |
2087 | @node View Multiple Experiments | |
2088 | @section View Multiple Experiments | |
2089 | @c ---------------------------------------------------------------------------- | |
2090 | ||
2091 | One thing we did not cover sofar is that @ToolName{} fully supports the analysis | |
2092 | of multiple experiments. The @DisplayText{} tool accepts a list of experiments. | |
2093 | The data can either be aggregated across the experiments, or used in a | |
2094 | comparison. | |
2095 | ||
2096 | The default is to aggregate the metric values across the experiments that have | |
2097 | been loaded. The @command{compare} command can be used to enable the | |
2098 | @IndexSubentry{Options, @code{-compare}} | |
2099 | @IndexSubentry{Commands, @code{compare}} | |
2100 | comparison of results. | |
2101 | ||
2102 | In this section both modes are illustrated with an example. | |
2103 | ||
2104 | @c -- A new node -------------------------------------------------------------- | |
2105 | @node Aggregation of Experiments | |
2106 | @subsection Aggregation of Experiments | |
2107 | @c ---------------------------------------------------------------------------- | |
2108 | ||
2109 | If the data for multiple experiments is aggregrated, the @DisplayText{} tool | |
2110 | shows the combined results. | |
2111 | For example, below is the script to show the function view for the data | |
2112 | aggregated over two experiments, drop the first experiment and then show | |
2113 | the function view fo the second experiment only. | |
2114 | We will call it @file{my-script-agg}. | |
2115 | ||
2116 | @cartouche | |
2117 | @smallexample | |
2118 | # Define the metrics | |
2119 | metrics e.%totalcpu | |
2120 | # Limit the output to 5 lines | |
2121 | limit 5 | |
2122 | # Get the list with experiments | |
2123 | experiment_list | |
2124 | # Get the function overview for all | |
2125 | functions | |
2126 | # Drop the first experiment | |
2127 | drop_exp mxv.2.thr.er | |
2128 | # Get the function overview for exp #2 | |
2129 | functions | |
2130 | @end smallexample | |
2131 | @end cartouche | |
2132 | ||
2133 | @IndexSubentry{Options, @code{-experiment_list}} | |
2134 | @IndexSubentry{Commands, @code{experiment_list}} | |
2135 | With the exception of the @command{experiment_list} command, all commands | |
2136 | used have been discussed earlier. | |
2137 | ||
2138 | The @command{experiment_list} command provides a list of the experiments | |
2139 | that have been loaded. This may be used to get the experiment IDs and | |
2140 | to verify the correct experiments are loaded for the aggregation. | |
2141 | ||
2142 | @noindent | |
2143 | Below is an example that loads two experiments and uses the above | |
2144 | script to display different function views. | |
2145 | ||
2146 | @cartouche | |
2147 | @smallexample | |
2148 | $ gprofng display text -script my-script-agg mxv.2.thr.er mxv.4.thr.er | |
2149 | @end smallexample | |
2150 | @end cartouche | |
2151 | ||
2152 | @noindent | |
2153 | This produces the following output: | |
2154 | ||
2155 | @smallexample | |
2156 | @verbatim | |
2157 | # Define the metrics | |
2158 | Current metrics: e.%totalcpu:name | |
2159 | Current Sort Metric: Exclusive Total CPU Time ( e.%totalcpu ) | |
2160 | # Limit the output to 5 lines | |
2161 | Print limit set to 5 | |
2162 | # Get the list with experiments | |
2163 | ID Sel PID Experiment | |
2164 | == === ======= ============ | |
2165 | 1 yes 1339450 mxv.2.thr.er | |
2166 | 2 yes 3579561 mxv.4.thr.er | |
2167 | # Get the function overview for all | |
2168 | Functions sorted by metric: Exclusive Total CPU Time | |
2169 | ||
2170 | Excl. Total Name | |
2171 | CPU | |
2172 | sec. % | |
2173 | 20.567 100.00 <Total> | |
2174 | 19.553 95.07 mxv_core | |
2175 | 0.474 2.30 init_data | |
2176 | 0.198 0.96 erand48_r | |
2177 | 0.149 0.72 drand48 | |
2178 | ||
2179 | # Drop the first experiment | |
2180 | Experiment mxv.2.thr.er has been dropped | |
2181 | # Get the function overview for exp #2 | |
2182 | Functions sorted by metric: Exclusive Total CPU Time | |
2183 | ||
2184 | Excl. Total Name | |
2185 | CPU | |
2186 | sec. % | |
2187 | 11.104 100.00 <Total> | |
2188 | 10.592 95.39 mxv_core | |
2189 | 0.249 2.24 init_data | |
2190 | 0.094 0.84 erand48_r | |
2191 | 0.082 0.74 drand48 | |
2192 | @end verbatim | |
2193 | @end smallexample | |
2194 | ||
2195 | The first five lines should look familiar. The five lines following echo | |
2196 | the comment line in the script and show the overview of the experiments. | |
2197 | This confirms two experiments have been loaded and that both are active. | |
2198 | This is followed by the function overview. The timings have been summed | |
2199 | up and the percentages are adjusted accordingly. | |
2200 | ||
2201 | @c -- A new node -------------------------------------------------------------- | |
2202 | @node Comparison of Experiments | |
2203 | @subsection Comparison of Experiments | |
2204 | @c ---------------------------------------------------------------------------- | |
2205 | ||
2206 | The support for multiple experiments really shines in comparison mode. | |
2207 | @cindex Compare experiments | |
2208 | In comparison mode, the data for the various experiments is shown side by | |
2209 | side, as illustrated below where we compare the results for the multithreaded | |
2210 | experiments using two and four threads respectively. | |
2211 | ||
2212 | This | |
2213 | feature is controlled through the | |
2214 | @IndexSubentry{Options, @code{-compare}} | |
2215 | @IndexSubentry{Commands, @code{compare}} | |
2216 | @code{compare} command. | |
2217 | ||
2218 | The comparison mode is enabled through @command{compare on} and with | |
2219 | @command{compare off} it is disabled again. | |
2220 | In addition to @samp{on}, or @samp{off}, this command also supports | |
2221 | the @samp{delta} and @samp{ratio} keywords. | |
2222 | ||
2223 | This is the script that will be used in our example. It sets the comparison | |
2224 | mode to @samp{on}: | |
2225 | ||
2226 | @smallexample | |
2227 | @verbatim | |
2228 | # Define the metrics | |
2229 | metrics e.%totalcpu | |
2230 | # Limit the output to 5 lines | |
2231 | limit 5 | |
2232 | # Set the comparison mode to differences | |
2233 | compare on | |
2234 | # Get the function overview | |
2235 | functions | |
2236 | @end verbatim | |
2237 | @end smallexample | |
2238 | ||
2239 | Assuming this script file is called @file{my-script-comp}, this is how | |
2240 | it is used to display the differences: | |
2241 | ||
2242 | @cartouche | |
2243 | @smallexample | |
2244 | $ gprofng display text -script my-script-comp mxv.2.thr.er mxv.4.thr.er | |
2245 | @end smallexample | |
2246 | @end cartouche | |
2247 | ||
2248 | @noindent | |
2249 | This produces the output shown below. The data for the first experiment | |
2250 | is shown as absolute numbers. The timings for the other experiment are | |
2251 | shown as a delta relative to these reference numbers: | |
2252 | ||
2253 | @smallexample | |
2254 | @verbatim | |
2255 | ||
2256 | mxv.2.thr.er mxv.4.thr.er | |
2257 | Excl. Total Excl. Total Name | |
2258 | CPU CPU | |
2259 | sec. % sec. % | |
2260 | 9.464 100.00 11.104 100.00 <Total> | |
2261 | 8.961 94.69 10.592 95.39 mxv_core | |
2262 | 0.224 2.37 0.249 2.24 init_data | |
2263 | 0.105 1.11 0.094 0.84 erand48_r | |
2264 | 0.073 0.77 0.060 0.54 __drand48_iterate | |
2265 | @end verbatim | |
2266 | @end smallexample | |
2267 | ||
2268 | This table is already helpful to more easily compare (two) profiles, but | |
2269 | there is more that we can do here. | |
2270 | ||
2271 | By default, in comparison mode, all measured values are shown. Often | |
2272 | profiling is about comparing performance data. It is therefore | |
2273 | sometimes more useful to look at differences or ratios, using one | |
2274 | experiment as a reference. | |
2275 | ||
2276 | The values shown are relative to this difference. For example if a ratio | |
2277 | is below one, it means the reference value was higher. | |
2278 | ||
2279 | In the example below, we use the same two experiments used in the comparison | |
2280 | above. The script is also nearly identical. The only change is that we now | |
2281 | use the @samp{delta} keyword. | |
2282 | ||
2283 | As before, the number of lines is restricted to 5 and we focus on | |
2284 | the exclusive timings plus percentages. For the comparison part we are | |
2285 | interested in the differences. | |
2286 | ||
2287 | This is the script that produces such an overview: | |
2288 | ||
2289 | @smallexample | |
2290 | @verbatim | |
2291 | # Define the metrics | |
2292 | metrics e.%totalcpu | |
2293 | # Limit the output to 5 lines | |
2294 | limit 5 | |
2295 | # Set the comparison mode to differences | |
2296 | compare delta | |
2297 | # Get the function overview | |
2298 | functions | |
2299 | @end verbatim | |
2300 | @end smallexample | |
2301 | ||
2302 | Assuming this script file is called @file{my-script-comp2}, this is how we | |
2303 | get the table displayed on our screen: | |
2304 | ||
2305 | @cartouche | |
2306 | @smallexample | |
2307 | $ gprofng display text -script my-script-comp2 mxv.2.thr.er mxv.4.thr.er | |
2308 | @end smallexample | |
2309 | @end cartouche | |
2310 | ||
2311 | Leaving out some of the lines printed, but we have seen before, we get | |
2312 | the following table: | |
2313 | ||
2314 | @smallexample | |
2315 | @verbatim | |
2316 | mxv.2.thr.er mxv.4.thr.er | |
2317 | Excl. Total Excl. Total Name | |
2318 | CPU CPU | |
2319 | sec. % delta % | |
2320 | 9.464 100.00 +1.640 100.00 <Total> | |
2321 | 8.961 94.69 +1.631 95.39 mxv_core | |
2322 | 0.224 2.37 +0.025 2.24 init_data | |
2323 | 0.105 1.11 -0.011 0.84 erand48_r | |
2324 | 0.073 0.77 -0.013 0.54 __drand48_iterate | |
2325 | @end verbatim | |
2326 | @end smallexample | |
2327 | ||
2328 | It is now easier to see that the CPU times for the most time consuming | |
2329 | functions in this code are practically the same. | |
2330 | ||
2331 | It is also possible to show ratio's through the @command{compare ratio} | |
2332 | @IndexSubentry{Options, @code{-compare}} | |
2333 | @IndexSubentry{Commands, @code{compare}} | |
2334 | command. The first colum is used as a reference and the values for | |
2335 | the other columns with metrics are derived by dividing the value by | |
2336 | the reference. The result for such a comparison is shown below: | |
2337 | ||
2338 | @smallexample | |
2339 | @verbatim | |
2340 | mxv.2.thr.er mxv.4.thr.er | |
2341 | Excl. Total Excl. Total CPU Name | |
2342 | CPU | |
2343 | sec. % ratio % | |
2344 | 9.464 100.00 x 1.173 100.00 <Total> | |
2345 | 8.961 94.69 x 1.182 95.39 mxv_core | |
2346 | 0.224 2.37 x 1.111 2.24 init_data | |
2347 | 0.105 1.11 x 0.895 0.84 erand48_r | |
2348 | 0.073 0.77 x 0.822 0.54 __drand48_iterate | |
2349 | @end verbatim | |
2350 | @end smallexample | |
2351 | ||
2352 | Note that the comparison feature is supported at the function, source, and | |
2353 | disassembly level. There is no practical limit on the number of experiments | |
2354 | that can be used in a comparison. | |
2355 | ||
2356 | @c -- A new node -------------------------------------------------------------- | |
2357 | @node Profile Hardware Event Counters | |
2358 | @section Profile Hardware Event Counters | |
2359 | @c ---------------------------------------------------------------------------- | |
2360 | ||
2361 | Many processors provide a set of hardware event counters and @ToolName{} | |
2362 | provides support for this feature. | |
2363 | @xref{Hardware Event Counters Explained} for those readers that are not | |
2364 | familiar with such counters and like to learn more. | |
2365 | ||
2366 | In this section we explain how to get the details on the event counter | |
2367 | support for the processor used in the experiment(s), and show several | |
2368 | examples. | |
2369 | ||
2370 | @c -- A new node -------------------------------------------------------------- | |
2371 | @node Getting Information on the Counters Supported | |
2372 | @subsection Getting Information on the Counters Supported | |
2373 | @c ---------------------------------------------------------------------------- | |
2374 | ||
2375 | The first step is to check if the processor used for the experiments is | |
2376 | supported by @ToolName{}. | |
2377 | @IndexSubentry{Options, @code{-h}} | |
2378 | The @code{-h} option on @CollectApp{} will show the event counter | |
2379 | information: | |
2380 | ||
2381 | @cartouche | |
2382 | @smallexample | |
2383 | $ gprofng collect app -h | |
2384 | @end smallexample | |
2385 | @end cartouche | |
2386 | ||
2387 | In case the counters are supported, a list with the events is printed. | |
2388 | Otherwise, a warning message will be issued. | |
2389 | ||
2390 | For example, below we show this command and the output on an Intel Xeon | |
2391 | Platinum 8167M (aka ``Skylake'') processor. The output has been split | |
2392 | into several sections and each section is commented upon separately. | |
2393 | ||
2394 | @smallexample | |
2395 | @verbatim | |
2396 | Run "gprofng collect app --help" for a usage message. | |
2397 | ||
2398 | Specifying HW counters on `Intel Arch PerfMon v2 on Family 6 Model 85' | |
2399 | (cpuver=2499): | |
2400 | ||
2401 | -h {auto|lo|on|hi} | |
2402 | turn on default set of HW counters at the specified rate | |
2403 | -h <ctr_def> [-h <ctr_def>]... | |
2404 | -h <ctr_def>[,<ctr_def>]... | |
2405 | specify HW counter profiling for up to 4 HW counters | |
2406 | @end verbatim | |
2407 | @end smallexample | |
2408 | ||
2409 | The first line shows how to get a usage overview. This is followed by | |
2410 | some information on the target processor. | |
2411 | The next five lines explain in what ways the @code{-h} option can be | |
2412 | used to define the events to be monitored. | |
2413 | ||
2414 | The first version shown above enables a default set of counters. This | |
2415 | default depends on the processor this command is executed on. The | |
2416 | keyword following the @code{-h} option defines the sampling rate: | |
2417 | ||
2418 | @table @code | |
2419 | ||
2420 | @item auto | |
2421 | Match the sample rate of used by clock profiling. If the latter is disabled, | |
2422 | Use a per thread sampling rate of approximately 100 samples per second. | |
2423 | This setting is the default and preferred. | |
2424 | ||
2425 | @item on | |
2426 | Use a per thread sampling rate of approximately 100 samples per second. | |
2427 | ||
2428 | @item lo | |
2429 | Use a per thread sampling rate of approximately 10 samples per second. | |
2430 | ||
2431 | @item hi | |
2432 | Use a per thread sampling rate of approximately 1000 samples per second. | |
2433 | ||
2434 | @end table | |
2435 | ||
2436 | The second and third variant define the events to be monitored. Note | |
2437 | that the number of simultaneous events supported is printed. In this | |
2438 | case we can monitor four events in a single profiling job. | |
2439 | ||
2440 | It is a matter of preference whether you like to use the @code{-h} | |
2441 | option for each event, or use it once, followed by a comma separated | |
2442 | list. | |
2443 | ||
2444 | There is one slight catch though. The counter definition below has | |
2445 | mandatory comma (@code{,}) between the event and the rate. While a | |
2446 | default can be used for the rate, the comma cannot be omitted. | |
2447 | This may result in a somewhat awkward counter definition in case | |
2448 | the default sampling rate is used. | |
2449 | ||
2450 | For example, the following two commands are equivalent. Note | |
2451 | the double comma in the second command. This is not a typo. | |
2452 | ||
2453 | @cartouche | |
2454 | @smallexample | |
2455 | $ gprofng collect app -h cycles -h insts ... | |
2456 | $ gprofng collect app -h cycles,,insts ... | |
2457 | @end smallexample | |
2458 | @end cartouche | |
2459 | ||
2460 | In the first command this comma is not needed, because a | |
2461 | comma (``@code{,}'') immediately followed by white space may | |
2462 | be omitted. | |
2463 | ||
2464 | This is why we prefer the this syntax and in the remainder will | |
2465 | use the first version of this command. | |
2466 | ||
2467 | @IndexSubentry{Hardware event counters, counter definition} | |
2468 | The counter definition takes an event name, plus optionally one or | |
2469 | more attributes, followed by a comma, and optionally the sampling rate. | |
2470 | The output section below shows the formal definition. | |
2471 | ||
2472 | @cartouche | |
2473 | @smallexample | |
2474 | <ctr_def> == <ctr>[[~<attr>=<val>]...],[<rate>] | |
2475 | @end smallexample | |
2476 | @end cartouche | |
2477 | ||
2478 | The printed help then explains this syntax. Below we have summarized | |
2479 | and expanded this output: | |
2480 | ||
2481 | @table @code | |
2482 | ||
2483 | @item @var{<ctr>} | |
2484 | The counter name must be selected from the available counters listed | |
2485 | as part of the output printed with the @code{-h} option. | |
2486 | On most systems, if a counter is not listed, it may still be specified | |
2487 | by its numeric value. | |
2488 | ||
2489 | @item @var{~<attr>=<val>} | |
2490 | This is an optional attribute that depends on the processor. The list | |
2491 | of supported attributes is printed in the output. Examples of | |
2492 | attributes are ``user'', or ``system''. The value can given in decimal | |
2493 | or hexadecimal format. | |
2494 | Multiple attributes may be specified, and each must be preceded | |
2495 | by a ~. | |
2496 | ||
2497 | @item @var{<rate>} | |
2498 | ||
2499 | The sampling rate is one of the following: | |
2500 | ||
2501 | @table @code | |
2502 | ||
2503 | @item auto | |
2504 | This is the default and matches the rate used by clock profiling. | |
2505 | If clock profiling is disabled, use @samp{on}. | |
2506 | ||
2507 | @item on | |
2508 | Set the per thread maximum sampling rate to ~100 samples/second | |
2509 | ||
2510 | @item lo | |
2511 | Set the per thread maximum sampling rate to ~10 samples/second | |
2512 | ||
2513 | @item hi | |
2514 | Set the per thread maximum sampling rate to ~1000 samples/second | |
2515 | ||
2516 | @item @var{<interval>} | |
2517 | Define the sampling interval. | |
2518 | @xref{Control the Sampling Frequency} how to define this. | |
2519 | ||
2520 | @end table | |
2521 | ||
2522 | @end table | |
2523 | ||
2524 | After the section with the formal definition of events and counters, a | |
2525 | processor specific list is displayed. This part starts with an overview | |
2526 | of the default set of counters and the aliased names supported | |
2527 | @emph{on this specific processor}. | |
2528 | ||
2529 | @smallexample | |
2530 | @verbatim | |
2531 | Default set of HW counters: | |
2532 | ||
2533 | -h cycles,,insts,,llm | |
2534 | ||
2535 | Aliases for most useful HW counters: | |
2536 | ||
2537 | alias raw name type units regs description | |
2538 | ||
2539 | cycles unhalted-core-cycles CPU-cycles 0123 CPU Cycles | |
2540 | insts instruction-retired events 0123 Instructions Executed | |
2541 | llm llc-misses events 0123 Last-Level Cache Misses | |
2542 | br_msp branch-misses-retired events 0123 Branch Mispredict | |
2543 | br_ins branch-instruction-retired events 0123 Branch Instructions | |
2544 | @end verbatim | |
2545 | @end smallexample | |
2546 | ||
2547 | @noindent | |
2548 | The definitions given above may or may not be available on other processors. | |
2549 | ||
2550 | The table above shows the default set of counters defined for this processor, | |
2551 | and the aliases. For each alias the full ``raw'' name is given, plus the | |
2552 | unit of the number returned by the counter (CPU cycles, or a raw count), | |
2553 | the hardware counter the event is allowed to be mapped onto, and a short | |
2554 | description. | |
2555 | ||
2556 | The last part of the output contains all the events that can be monitored: | |
2557 | ||
2558 | @smallexample | |
2559 | @verbatim | |
2560 | Raw HW counters: | |
2561 | ||
2562 | name type units regs description | |
2563 | ||
2564 | unhalted-core-cycles CPU-cycles 0123 | |
2565 | unhalted-reference-cycles events 0123 | |
2566 | instruction-retired events 0123 | |
2567 | llc-reference events 0123 | |
2568 | llc-misses events 0123 | |
2569 | branch-instruction-retired events 0123 | |
2570 | branch-misses-retired events 0123 | |
2571 | ld_blocks.store_forward events 0123 | |
2572 | ld_blocks.no_sr events 0123 | |
2573 | ld_blocks_partial.address_alias events 0123 | |
2574 | dtlb_load_misses.miss_causes_a_walk events 0123 | |
2575 | dtlb_load_misses.walk_completed_4k events 0123 | |
2576 | ||
2577 | <many lines deleted> | |
2578 | ||
2579 | l2_lines_out.silent events 0123 | |
2580 | l2_lines_out.non_silent events 0123 | |
2581 | l2_lines_out.useless_hwpf events 0123 | |
2582 | sq_misc.split_lock events 0123 | |
2583 | @end verbatim | |
2584 | @end smallexample | |
2585 | ||
2586 | As can be seen, these names are not always easy to correlate to a specific | |
2587 | event of interest. The processor manual should provide more clarity on this. | |
2588 | ||
2589 | @c -- A new node -------------------------------------------------------------- | |
2590 | @node Examples Using Hardware Event Counters | |
2591 | @subsection Examples Using Hardware Event Counters | |
2592 | @c ---------------------------------------------------------------------------- | |
2593 | ||
2594 | The previous section may give the impression that these counters are hard to | |
2595 | use, but as we will show now, in practice it is quite simple. | |
2596 | ||
2597 | With the information from the @code{-h} option, we can easily set up our first | |
2598 | event counter experiment. | |
2599 | ||
2600 | We start by using the default set of counters defined for our processor and we | |
2601 | use 2 threads: | |
2602 | ||
2603 | @cartouche | |
2604 | @smallexample | |
2605 | $ exe=mxv-pthreads | |
2606 | $ m=8000 | |
2607 | $ n=4000 | |
2608 | $ exp=mxv.hwc.def.2.thr.er | |
2609 | $ gprofng collect app -O $exp -h auto ./$exe -m $m -n $n -t 2 | |
2610 | @end smallexample | |
2611 | @end cartouche | |
2612 | ||
2613 | @IndexSubentry{Options, @code{-h}} | |
2614 | @IndexSubentry{Hardware event counters, @code{auto} option} | |
2615 | The new option here is @code{-h auto}. The @code{auto} keyword enables | |
2616 | hardware event counter profiling and selects the default set of counters | |
2617 | defined for this processor. | |
2618 | ||
2619 | As before, we can display the information, but there is one practical hurdle | |
2620 | to take. Unless we like to view all metrics recorded, we would need to know | |
2621 | the names of the events that have been enabled. This is tedious and also not | |
2622 | portable in case we would like to repeat this experiment on another processor. | |
2623 | ||
2624 | @IndexSubentry{Hardware event counters, @code{hwc} metric} | |
2625 | This is where the special @code{hwc} metric comes very handy. It | |
46c56757 VM |
2626 | automatically expands to the active set of hardware event counters used |
2627 | in the experiment(s). | |
7a515757 VM |
2628 | |
2629 | With this, it is very easy to display the event counter values. Note that | |
2630 | although the regular clock based profiling was enabled, we only want to see | |
2631 | the counter values. We also request to see the percentages and limit the | |
2632 | output to the first 5 lines: | |
2633 | ||
2634 | @cartouche | |
2635 | @smallexample | |
2636 | $ exp=mxv.hwc.def.2.thr.er | |
2637 | $ gprofng display text -metrics e.%hwc -limit 5 -functions $exp | |
2638 | @end smallexample | |
2639 | @end cartouche | |
2640 | ||
2641 | @smallexample | |
2642 | @verbatim | |
2643 | Current metrics: e.%cycles:e+%insts:e+%llm:name | |
2644 | Current Sort Metric: Exclusive CPU Cycles ( e.%cycles ) | |
2645 | Print limit set to 5 | |
2646 | Functions sorted by metric: Exclusive CPU Cycles | |
2647 | ||
2648 | Excl. CPU Excl. Instructions Excl. Last-Level Name | |
2649 | Cycles Executed Cache Misses | |
2650 | sec. % % % | |
2651 | 2.691 100.00 7906475309 100.00 122658983 100.00 <Total> | |
2652 | 2.598 96.54 7432724378 94.01 121745696 99.26 mxv_core | |
2653 | 0.035 1.31 188860269 2.39 70084 0.06 erand48_r | |
2654 | 0.026 0.95 73623396 0.93 763116 0.62 init_data | |
2655 | 0.018 0.66 76824434 0.97 40040 0.03 drand48 | |
2656 | @end verbatim | |
2657 | @end smallexample | |
2658 | ||
2659 | As we have seen before, the first few lines echo the settings. | |
2660 | This includes a list with the hardware event counters used by | |
2661 | default. | |
2662 | ||
2663 | The table that follows makes it very easy to get an overview where the | |
2664 | time is spent and how many of the target events have occurred. | |
2665 | ||
2666 | As before, we can drill down deeper and see the same metrics at the source | |
2667 | line and instruction level. Other than using @code{hwc} in the metrics | |
2668 | definitions, nothing has changed compared to the previous examples: | |
2669 | ||
2670 | @cartouche | |
2671 | @smallexample | |
2672 | $ exp=mxv.hwc.def.2.thr.er | |
2673 | $ gprofng display text -metrics e.hwc -source mxv_core $exp | |
2674 | @end smallexample | |
2675 | @end cartouche | |
2676 | ||
2677 | This is the relevant part of the output. Since the lines get very long, | |
2678 | we have somewhat modified the lay-out: | |
2679 | ||
2680 | @smallexample | |
2681 | @verbatim | |
2682 | Excl. CPU Excl. Excl. | |
2683 | Cycles Instructions Last-Level | |
2684 | sec. Executed Cache Misses | |
2685 | <Function: mxv_core> | |
2686 | 0. 0 0 32. void __attribute__ ((noinline)) | |
2687 | mxv_core(...) | |
2688 | 0. 0 0 33. { | |
2689 | 0. 0 0 34. for (uint64_t i=...) { | |
2690 | 0. 0 0 35. double row_sum = 0.0; | |
2691 | ## 1.872 7291879319 88150571 36. for (int64_t j=0; j<n; j++) | |
2692 | 0.725 140845059 33595125 37. row_sum += A[i][j]*b[j]; | |
2693 | 0. 0 0 38. c[i] = row_sum; | |
2694 | 39. } | |
2695 | 0. 0 0 40. } | |
2696 | @end verbatim | |
2697 | @end smallexample | |
2698 | ||
2699 | In a smiliar way we can display the event counter values at the instruction | |
2700 | level. Again we have modified the lay-out due to page width limitations: | |
2701 | ||
2702 | @cartouche | |
2703 | @smallexample | |
2704 | $ exp=mxv.hwc.def.2.thr.er | |
2705 | $ gprofng display text -metrics e.hwc -disasm mxv_core $exp | |
2706 | @end smallexample | |
2707 | @end cartouche | |
2708 | ||
2709 | @smallexample | |
2710 | @verbatim | |
2711 | Excl. CPU Excl. Excl. | |
2712 | Cycles Instructions Last-Level | |
2713 | sec. Executed Cache Misses | |
2714 | <Function: mxv_core> | |
2715 | 0. 0 0 [33] 4021ba: mov 0x8(%rsp),%r10 | |
2716 | 34. for (uint64_t i=...) { | |
2717 | 0. 0 0 [34] 4021bf: cmp %rsi,%rdi | |
2718 | 0. 0 0 [34] 4021c2: jbe 0x37 | |
2719 | 0. 0 0 [34] 4021c4: ret | |
2720 | 35. double row_sum = 0.0; | |
2721 | 36. for (int64_t j=0; j<n; j++) | |
2722 | 37. row_sum += A[i][j]*b[j]; | |
2723 | 0. 0 0 [37] 4021c5: mov (%r8,%rdi,8),%rdx | |
2724 | 0. 0 0 [36] 4021c9: mov $0x0,%eax | |
2725 | 0. 0 0 [35] 4021ce: pxor %xmm1,%xmm1 | |
2726 | 0.002 12804230 321394 [37] 4021d2: movsd (%rdx,%rax,8),%xmm0 | |
2727 | 0.141 60819025 3866677 [37] 4021d7: mulsd (%r9,%rax,8),%xmm0 | |
2728 | 0.582 67221804 29407054 [37] 4021dd: addsd %xmm0,%xmm1 | |
2729 | ## 1.871 7279075109 87989870 [36] 4021e1: add $0x1,%rax | |
2730 | 0.002 12804210 80351 [36] 4021e5: cmp %rax,%rcx | |
2731 | 0. 0 0 [36] 4021e8: jne 0xffffffffffffffea | |
2732 | 38. c[i] = row_sum; | |
2733 | 0. 0 0 [38] 4021ea: movsd %xmm1,(%r10,%rdi,8) | |
2734 | 0. 0 0 [34] 4021f0: add $0x1,%rdi | |
2735 | 0. 0 0 [34] 4021f4: cmp %rdi,%rsi | |
2736 | 0. 0 0 [34] 4021f7: jb 0xd | |
2737 | 0. 0 0 [35] 4021f9: pxor %xmm1,%xmm1 | |
2738 | 0. 0 0 [36] 4021fd: test %rcx,%rcx | |
2739 | 0. 0 80350 [36] 402200: jne 0xffffffffffffffc5 | |
2740 | 0. 0 0 [36] 402202: jmp 0xffffffffffffffe8 | |
2741 | 39. } | |
2742 | 40. } | |
2743 | 0. 0 0 [40] 402204: ret | |
2744 | @end verbatim | |
2745 | @end smallexample | |
2746 | ||
2747 | So far we have used the default settings for the event counters. It is | |
2748 | quite straightforward to select specific counters. For sake of the | |
2749 | example, let's assume we would like to count how many branch instructions | |
2750 | and retired memory load instructions that missed in the L1 cache have been | |
2751 | executed. We also want to count these events with a high resolution. | |
2752 | ||
2753 | This is the command to do so: | |
2754 | ||
2755 | @cartouche | |
2756 | @smallexample | |
2757 | $ exe=mxv-pthreads | |
2758 | $ m=8000 | |
2759 | $ n=4000 | |
2760 | $ exp=mxv.hwc.sel.2.thr.er | |
2761 | $ hwc1=br_ins,hi | |
2762 | $ hwc2=mem_load_retired.l1_miss,hi | |
2763 | $ gprofng collect app -O $exp -h $hwc1 -h $hwc2 $exe -m $m -n $n -t 2 | |
2764 | @end smallexample | |
2765 | @end cartouche | |
2766 | ||
2767 | As before, we get a table with the event counts. Due to the very | |
2768 | long name for the second counter, we have somewhat modified the | |
2769 | output. | |
2770 | ||
2771 | @cartouche | |
2772 | @smallexample | |
2773 | $ gprofng display text -limit 10 -functions mxv.hwc.sel.2.thr.er | |
2774 | @end smallexample | |
2775 | @end cartouche | |
2776 | ||
2777 | @smallexample | |
2778 | @verbatim | |
2779 | Functions sorted by metric: Exclusive Total CPU Time | |
2780 | Excl. Incl. Excl. Branch Excl. Name | |
2781 | Total Total Instructions mem_load_retired.l1_miss | |
2782 | CPU sec. CPU sec. Events | |
2783 | 2.597 2.597 1305305319 4021340 <Total> | |
2784 | 2.481 2.481 1233233242 3982327 mxv_core | |
2785 | 0.040 0.107 19019012 9003 init_data | |
2786 | 0.028 0.052 23023048 15006 erand48_r | |
2787 | 0.024 0.024 19019008 9004 __drand48_iterate | |
2788 | 0.015 0.067 11011009 2998 drand48 | |
2789 | 0.008 0.010 0 3002 _int_malloc | |
2790 | 0.001 0.001 0 0 brk | |
2791 | 0.001 0.002 0 0 sysmalloc | |
2792 | 0. 0.001 0 0 __default_morecore | |
2793 | @end verbatim | |
2794 | @end smallexample | |
2795 | ||
2796 | @IndexSubentry{Options, @code{-compare}} | |
2797 | @IndexSubentry{Commands, @code{compare}} | |
2798 | When using event counters, the values could be very large and it is not easy | |
2799 | to compare the numbers. As we will show next, the @code{ratio} feature is | |
2800 | very useful when comparing such profiles. | |
2801 | ||
2802 | To demonstrate this, we have set up another event counter experiment where | |
2803 | we would like to compare the number of last level cache miss and the number | |
2804 | of branch instructions executed when using a single thread, or two threads. | |
2805 | ||
2806 | These are the commands used to generate the experiment directories: | |
2807 | ||
2808 | @cartouche | |
2809 | @smallexample | |
2810 | $ exe=./mxv-pthreads | |
2811 | $ m=8000 | |
2812 | $ n=4000 | |
2813 | $ exp1=mxv.hwc.comp.1.thr.er | |
2814 | $ exp2=mxv.hwc.comp.2.thr.er | |
2815 | $ gprofng collect app -O $exp1 -h llm -h br_ins $exe -m $m -n $n -t 1 | |
2816 | $ gprofng collect app -O $exp2 -h llm -h br_ins $exe -m $m -n $n -t 2 | |
2817 | @end smallexample | |
2818 | @end cartouche | |
2819 | ||
2820 | The following script has been used to get the tables. Due to lay-out | |
2821 | restrictions, we have to create two tables, one for each counter. | |
2822 | ||
2823 | @cartouche | |
2824 | @smallexample | |
2825 | # Limit the output to 5 lines | |
2826 | limit 5 | |
2827 | # Define the metrics | |
2828 | metrics name:e.llm | |
2829 | # Set the comparison to ratio | |
2830 | compare ratio | |
2831 | functions | |
2832 | # Define the metrics | |
2833 | metrics name:e.br_ins | |
2834 | # Set the comparison to ratio | |
2835 | compare ratio | |
2836 | functions | |
2837 | @end smallexample | |
2838 | @end cartouche | |
2839 | ||
2840 | Note that we print the name of the function first, followed by the counter | |
2841 | data. | |
2842 | The new element is that we set the comparison mode to @code{ratio}. This | |
2843 | divides the data in a column by its counterpart in the reference experiment. | |
2844 | ||
2845 | This is the command using this script and the two experiment directories as | |
2846 | input: | |
2847 | ||
2848 | @cartouche | |
2849 | @smallexample | |
2850 | $ gprofng display text -script my-script-comp-counters \ | |
2851 | mxv.hwc.comp.1.thr.er \ | |
2852 | mxv.hwc.comp.2.thr.er | |
2853 | @end smallexample | |
2854 | @end cartouche | |
2855 | ||
2856 | By design, we get two tables, one for each counter: | |
2857 | ||
2858 | @smallexample | |
2859 | @verbatim | |
2860 | Functions sorted by metric: Exclusive Last-Level Cache Misses | |
2861 | ||
2862 | mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er | |
2863 | Name Excl. Last-Level Excl. Last-Level | |
2864 | Cache Misses Cache Misses | |
2865 | ratio | |
2866 | <Total> 122709276 x 0.788 | |
2867 | mxv_core 121796001 x 0.787 | |
2868 | init_data 723064 x 1.055 | |
2869 | erand48_r 100111 x 0.500 | |
2870 | drand48 60065 x 1.167 | |
2871 | ||
2872 | Functions sorted by metric: Exclusive Branch Instructions | |
2873 | ||
2874 | mxv.hwc.comp.1.thr.er mxv.hwc.comp.2.thr.er | |
2875 | Name Excl. Branch Excl. Branch | |
2876 | Instructions Instructions | |
2877 | ratio | |
2878 | <Total> 1307307316 x 0.997 | |
2879 | mxv_core 1235235239 x 0.997 | |
2880 | erand48_r 23023033 x 0.957 | |
2881 | drand48 20020009 x 0.600 | |
2882 | __drand48_iterate 17017028 x 0.882 | |
2883 | @end verbatim | |
2884 | @end smallexample | |
2885 | ||
2886 | A ratio less than one in the second column, means that this counter | |
2887 | value was smaller than the value from the reference experiment shown | |
2888 | in the first column. | |
2889 | ||
2890 | This kind of presentation of the results makes it much easier to | |
2891 | quickly interpret the data. | |
2892 | ||
2893 | We conclude this section with thread-level event counter overviews, | |
2894 | but before we go into this, there is an important metric we need to | |
2895 | mention. | |
2896 | ||
2897 | @c -- TBD Explain <Total> for IPC | |
2898 | ||
2899 | @IndexSubentry{Hardware event counters, IPC} | |
2900 | In case it is known how many instructions and CPU cycles have been executed, | |
2901 | the value for the IPC (``Instructions Per Clockycle'') can be computed. | |
2902 | @xref{Hardware Event Counters Explained}. | |
2903 | This is a derived metric that gives an indication how well the processor | |
2904 | is utilized. The inverse of the IPC is called CPI. | |
2905 | @IndexSubentry{Hardware event counters, CPI} | |
2906 | ||
2907 | The @DisplayText{} command automatically computes the IPC and CPI values | |
2908 | if an experiment contains the event counter values for the instructions | |
2909 | and CPU cycles executed. These are part of the metric list and can be | |
2910 | displayed, just like any other metric. | |
2911 | ||
2912 | @IndexSubentry{Options, @code{-metric_list}} | |
2913 | @IndexSubentry{Commands, @code{metric_list}} | |
2914 | This can be verified through the @command{metric_list} command. If we go | |
2915 | back to our earlier experiment with the default event counters, we get | |
2916 | the following result. | |
2917 | ||
2918 | @cartouche | |
2919 | @smallexample | |
2920 | $ gprofng display text -metric_list mxv.hwc.def.2.thr.er | |
2921 | @end smallexample | |
2922 | @end cartouche | |
2923 | ||
2924 | @smallexample | |
2925 | @verbatim | |
2926 | Current metrics: e.totalcpu:i.totalcpu:e.cycles:e+insts:e+llm:name | |
2927 | Current Sort Metric: Exclusive Total CPU Time ( e.totalcpu ) | |
2928 | Available metrics: | |
2929 | Exclusive Total CPU Time: e.%totalcpu | |
2930 | Inclusive Total CPU Time: i.%totalcpu | |
2931 | Exclusive CPU Cycles: e.+%cycles | |
2932 | Inclusive CPU Cycles: i.+%cycles | |
2933 | Exclusive Instructions Executed: e+%insts | |
2934 | Inclusive Instructions Executed: i+%insts | |
2935 | Exclusive Last-Level Cache Misses: e+%llm | |
2936 | Inclusive Last-Level Cache Misses: i+%llm | |
2937 | Exclusive Instructions Per Cycle: e+IPC | |
2938 | Inclusive Instructions Per Cycle: i+IPC | |
2939 | Exclusive Cycles Per Instruction: e+CPI | |
2940 | Inclusive Cycles Per Instruction: i+CPI | |
2941 | Size: size | |
2942 | PC Address: address | |
2943 | Name: name | |
2944 | @end verbatim | |
2945 | @end smallexample | |
2946 | ||
2947 | Among the other metrics, we see the new metrics for the IPC and CPI | |
2948 | listed. | |
2949 | ||
2950 | In the script below, we use this information and add the IPC and CPI | |
2951 | to the metrics to be displayed. We also use a the thread filter to | |
2952 | display these values for the individual threads. | |
2953 | ||
2954 | This is the complete script we have used. Other than a different selection | |
2955 | of the metrics, there are no new features. | |
2956 | ||
2957 | @cartouche | |
2958 | @smallexample | |
2959 | # Define the metrics | |
2960 | metrics e.insts:e.%cycles:e.IPC:e.CPI | |
2961 | # Sort with respect to cycles | |
2962 | sort e.cycles | |
2963 | # Limit the output to 5 lines | |
2964 | limit 5 | |
2965 | # Get the function overview for all threads | |
2966 | functions | |
2967 | # Get the function overview for thread 1 | |
2968 | thread_select 1 | |
2969 | functions | |
2970 | # Get the function overview for thread 2 | |
2971 | thread_select 2 | |
2972 | functions | |
2973 | # Get the function overview for thread 3 | |
2974 | thread_select 3 | |
2975 | functions | |
2976 | @end smallexample | |
2977 | @end cartouche | |
2978 | ||
2979 | In the metrics definition on the second line, we explicitly request the | |
2980 | counter values for the instructions (@code{e.insts}) and CPU cycles | |
2981 | (@code{e.cycles}) executed. These names can be found in output from the | |
2982 | @IndexSubentry{Options, @code{-metric_list}} | |
2983 | @IndexSubentry{Commands, @code{metric_list}} | |
2984 | @command{metric_list} command above. | |
2985 | In addition to these metrics, we also request the IPC and CPI to be shown. | |
2986 | ||
2987 | @IndexSubentry{Options, @code{-limit}} | |
2988 | @IndexSubentry{Commands, @code{limit}} | |
2989 | As before, we used the @command{limit} command to control the number of | |
2990 | functions displayed. We then request an overview for all the threads, | |
2991 | followed by three sets of two commands to select a thread and display the | |
2992 | function overview. | |
2993 | ||
2994 | The script above is used as follows: | |
2995 | ||
2996 | @cartouche | |
2997 | @smallexample | |
2998 | $ gprofng display text -script my-script-ipc mxv.hwc.def.2.thr.er | |
2999 | @end smallexample | |
3000 | @end cartouche | |
3001 | ||
3002 | @noindent | |
3003 | This script produces four tables. We list them separately below, | |
3004 | and have left out the additional output. | |
3005 | ||
3006 | @noindent | |
3007 | The first table shows the accumulated values across the three | |
3008 | threads that have been active. | |
3009 | ||
3010 | @smallexample | |
3011 | @verbatim | |
3012 | Functions sorted by metric: Exclusive CPU Cycles | |
3013 | ||
3014 | Excl. Excl. CPU Excl. Excl. Name | |
3015 | Instructions Cycles IPC CPI | |
3016 | Executed sec. % | |
3017 | 7906475309 2.691 100.00 1.473 0.679 <Total> | |
3018 | 7432724378 2.598 96.54 1.434 0.697 mxv_core | |
3019 | 188860269 0.035 1.31 2.682 0.373 erand48_r | |
3020 | 73623396 0.026 0.95 1.438 0.696 init_data | |
3021 | 76824434 0.018 0.66 2.182 0.458 drand48 | |
3022 | @end verbatim | |
3023 | @end smallexample | |
3024 | ||
3025 | @noindent | |
3026 | This shows that IPC of this program is completely dominated | |
3027 | by function @code{mxv_core}. It has a fairly low IPC value | |
3028 | of 1.43. | |
3029 | ||
3030 | @noindent | |
3031 | The next table is for thread 1 and shows the values for the | |
3032 | main thread. | |
3033 | ||
3034 | @smallexample | |
3035 | @verbatim | |
3036 | Exp Sel Total | |
3037 | === === ===== | |
3038 | 1 1 3 | |
3039 | Functions sorted by metric: Exclusive CPU Cycles | |
3040 | ||
3041 | Excl. Excl. CPU Excl. Excl. Name | |
3042 | Instructions Cycles IPC CPI | |
3043 | Executed sec. % | |
3044 | 473750931 0.093 100.00 2.552 0.392 <Total> | |
3045 | 188860269 0.035 37.93 2.682 0.373 erand48_r | |
3046 | 73623396 0.026 27.59 1.438 0.696 init_data | |
3047 | 76824434 0.018 18.97 2.182 0.458 drand48 | |
3048 | 134442832 0.013 13.79 5.250 0.190 __drand48_iterate | |
3049 | @end verbatim | |
3050 | @end smallexample | |
3051 | ||
3052 | @noindent | |
3053 | Although this thread hardly uses any CPU cycles, the overall IPC | |
3054 | of 2.55 is not all that bad. | |
3055 | ||
3056 | @noindent | |
3057 | Last, we show the tables for threads 2 and 3: | |
3058 | ||
3059 | @smallexample | |
3060 | @verbatim | |
3061 | Exp Sel Total | |
3062 | === === ===== | |
3063 | 1 2 3 | |
3064 | Functions sorted by metric: Exclusive CPU Cycles | |
3065 | ||
3066 | Excl. Excl. CPU Excl. Excl. Name | |
3067 | Instructions Cycles IPC CPI | |
3068 | Executed sec. % | |
3069 | 3716362189 1.298 100.00 1.435 0.697 <Total> | |
3070 | 3716362189 1.298 100.00 1.435 0.697 mxv_core | |
3071 | 0 0. 0. 0. 0. collector_root | |
3072 | 0 0. 0. 0. 0. driver_mxv | |
3073 | ||
3074 | Exp Sel Total | |
3075 | === === ===== | |
3076 | 1 3 3 | |
3077 | Functions sorted by metric: Exclusive CPU Cycles | |
3078 | ||
3079 | Excl. Excl. CPU Excl. Excl. Name | |
3080 | Instructions Cycles IPC CPI | |
3081 | Executed sec. % | |
3082 | 3716362189 1.300 100.00 1.433 0.698 <Total> | |
3083 | 3716362189 1.300 100.00 1.433 0.698 mxv_core | |
3084 | 0 0. 0. 0. 0. collector_root | |
3085 | 0 0. 0. 0. 0. driver_mxv | |
3086 | @end verbatim | |
3087 | @end smallexample | |
3088 | ||
3089 | It is seen that both execute the same number of instructions and | |
3090 | take about the same number of CPU cycles. As a result, the IPC is | |
3091 | the same for both threads. | |
3092 | ||
3093 | @c -- A new node -------------------------------------------------------------- | |
3094 | @c TBD @node Additional Features | |
3095 | @c TBD @section Additional Features | |
3096 | @c ---------------------------------------------------------------------------- | |
3097 | ||
3098 | @c -- A new node -------------------------------------------------------------- | |
3099 | @c TBD @node More Filtering Capabilities | |
3100 | @c TBD @subsection More Filtering Capabilities | |
3101 | @c ---------------------------------------------------------------------------- | |
3102 | ||
3103 | @c TBD Cover @code{samples} and @code{seconds} | |
3104 | ||
3105 | @c -- A new node -------------------------------------------------------------- | |
3106 | @node Java Profiling | |
3107 | @section Java Profiling | |
3108 | @c ---------------------------------------------------------------------------- | |
3109 | ||
3110 | @IndexSubentry{Options, @code{-j}} | |
3111 | @IndexSubentry{Java profiling, @code{-j on/off}} | |
3112 | The @CollectApp{} command supports Java profiling. The @code{-j on} option | |
3113 | can be used for this, but since this feature is enabled by default, there is | |
3114 | no need to set this explicitly. Java profiling may be disabled through the | |
3115 | @code{-j off} option. | |
3116 | ||
3117 | The program is compiled as usual and the experiment directory is created | |
3118 | similar to what we have seen before. The only difference with a C/C++ | |
3119 | application is that the program has to be explicitly executed by java. | |
3120 | ||
3121 | For example, this is how to generate the experiment data for a Java | |
3122 | program that has the source code stored in file @code{Pi.java}: | |
3123 | ||
3124 | @cartouche | |
3125 | @smallexample | |
3126 | $ javac Pi.java | |
3127 | $ gprofng collect app -j on -O pi.demo.er java Pi < pi.in | |
3128 | @end smallexample | |
3129 | @end cartouche | |
3130 | ||
3131 | Regarding which java is selected to generate the data, @ToolName{} | |
3132 | first looks for the JDK in the path set in either the | |
3133 | @IndexSubentry{Java profiling, @code{JDK_HOME}} | |
3134 | @code{JDK_HOME} environment variable, or in the | |
3135 | @IndexSubentry{Java profiling, @code{JAVA_PATH}} | |
3136 | @code{JAVA_PATH} environment variable. If neither of these variables is | |
3137 | set, it checks for a JDK in the search path (set in the PATH | |
3138 | environment variable). If there is no JDK in this path, it checks for | |
3139 | the java executable in @code{/usr/java/bin/java}. | |
3140 | ||
3141 | In case additional options need to be passed on to the JVM, the | |
3142 | @IndexSubentry{Options, @code{-J}} | |
3143 | @IndexSubentry{Java profiling, @code{-J <string>}} | |
3144 | @code{-J <string>} option can be used. The string with the | |
3145 | option(s) has to be delimited by quotation marks in case | |
3146 | there is more than one argument. | |
3147 | ||
3148 | The @DisplayText{} command may be used to view the performance data. There is | |
3149 | no need for any special options and the same commands as previously discussed | |
3150 | are supported. | |
3151 | ||
3152 | @IndexSubentry{Options, @code{-viewmode}} | |
3153 | @IndexSubentry{Commands, @code{viewmode}} | |
3154 | @IndexSubentry{Java profiling, different view modes} | |
3155 | The @code{viewmode} command | |
3156 | @xref{The Viewmode} | |
3157 | is very useful to examine the call stacks. | |
3158 | ||
3159 | For example, this is how one can see the native call stacks. For | |
3160 | lay-out purposes we have restricted the list to the first five entries: | |
3161 | ||
3162 | @cartouche | |
3163 | @smallexample | |
3164 | $ gprofng display text -limit 5 -viewmode machine -calltree pi.demo.er | |
3165 | @end smallexample | |
3166 | @end cartouche | |
3167 | ||
3168 | @smallexample | |
3169 | @verbatim | |
3170 | Print limit set to 5 | |
3171 | Viewmode set to machine | |
3172 | Functions Call Tree. Metric: Attributed Total CPU Time | |
3173 | ||
3174 | Attr. Name | |
3175 | Total | |
3176 | CPU sec. | |
3177 | 1.381 +-<Total> | |
3178 | 1.171 +-Pi.calculatePi(double) | |
3179 | 0.110 +-collector_root | |
3180 | 0.110 | +-JavaMain | |
3181 | 0.070 | +-jni_CallStaticVoidMethod | |
3182 | @end verbatim | |
3183 | @end smallexample | |
3184 | ||
3185 | @noindent | |
3186 | Note that the selection of the viewmode is echoed in the output. | |
3187 | ||
3188 | @c -- A new node -------------------------------------------------------------- | |
3189 | @node The gprofng Tools | |
3190 | @chapter The gprofng Tools | |
3191 | @c ---------------------------------------------------------------------------- | |
3192 | ||
3193 | Several tools are included in @ProductName{}. In subsequent chapters these | |
3194 | are discussed in detail. Below a brief description is given, followed by an | |
3195 | overview of the environment variables that are supported. | |
3196 | ||
3197 | @c -- A new node -------------------------------------------------------------- | |
3198 | @node Tools Overview | |
3199 | @section Tools Overview | |
3200 | @c ---------------------------------------------------------------------------- | |
3201 | ||
3202 | The following tools are supported by @ProductName{}: | |
3203 | ||
3204 | @table @code | |
3205 | ||
3206 | @item @CollectApp{} | |
3207 | @IndexSubentry{@code{gprofng}, @code{collect app}} | |
3208 | ||
3209 | Collects the performance data and stores the results in an experiment | |
3210 | directory. There are many options on this tool, but quite often the | |
3211 | defaults are sufficient. | |
3212 | An experiment directory is required for the subsequent analysis of | |
3213 | the results. | |
3214 | ||
3215 | @item @DisplayText{} | |
3216 | @IndexSubentry{@code{gprofng}, @code{display text}} | |
46c56757 | 3217 | |
7a515757 VM |
3218 | Generates performance reports in ASCII format. Commandline |
3219 | options, and/or commands in a script file are used to control the contents | |
3220 | and lay-out of the generated report(s). | |
3221 | ||
3222 | @item @DisplayHTML{} | |
3223 | @IndexSubentry{@code{gprofng}, @code{display html}} | |
46c56757 | 3224 | |
7a515757 VM |
3225 | Takes one or more experiment directories and generates a directory with |
3226 | HTML files. Starting from the index.html file, the performance data | |
3227 | may be examined in a browser. | |
3228 | ||
3229 | @item @DisplaySRC{} | |
3230 | @IndexSubentry{@code{gprofng}, @code{display src}} | |
46c56757 | 3231 | |
7a515757 VM |
3232 | Displays the source code, interleaved with the disassembled instructions. |
3233 | ||
3234 | @item @Archive{} | |
3235 | @IndexSubentry{@code{gprofng}, @code{archive}} | |
46c56757 | 3236 | |
7a515757 VM |
3237 | Archives an experiment directory by (optionally) including source code and |
3238 | object files, as well as the shared libraries that have been used. | |
3239 | ||
46c56757 VM |
3240 | @item @GUI{} |
3241 | @IndexSubentry{@code{gprofng}, @code{display gui}} | |
3242 | ||
3243 | This is an optional component that can be installed in addition to the | |
3244 | command line gprofng tools listed above. It supports the graphical | |
3245 | analysis of one or more experiments that have been created using | |
3246 | @CollectApp{}. | |
3247 | ||
3248 | The GUI part of @ProductName{} is a GNU project. This is the link to the | |
3249 | @url{https://savannah.gnu.org/projects/gprofng-gui, gprofng GUI page}. | |
3250 | This page contains more information (e.g. how to clone the repo). | |
3251 | There is also a | |
3252 | @url{https://ftp.gnu.org/gnu/gprofng-gui, tar file distribution directory} | |
3253 | with tar files that include everything that is needed to build and install | |
3254 | the GUI code. Various versions are available here. | |
3255 | Be aware that in order to build and use the gprofng GUI, Java needs to be | |
3256 | installed first. The minimum Java version required is Java 8. | |
3257 | ||
7a515757 VM |
3258 | @end table |
3259 | ||
3260 | @c -- A new section ----------------------------------------------------------- | |
3261 | @node The gprofng.rc file with default settings | |
3262 | @section The gprofng.rc file with default settings | |
3263 | @c ---------------------------------------------------------------------------- | |
3264 | The @file{gprofng.rc} | |
3265 | @cindex gprofng.rc | |
46c56757 VM |
3266 | file is used to define default settings for the @DisplayText{}, @Archive{}, |
3267 | and @DisplaySRC{} tools, but the user can override these defaults through | |
3268 | local configuration settings when building and installing from the source | |
3269 | code.. | |
7a515757 VM |
3270 | |
3271 | There are three files that are checked when the tool starts up. The first | |
3272 | file has pre-defined settings and comes with the installation, but through | |
46c56757 | 3273 | a hidden file called @file{.gprofng.rc}, the user can (re)define the defaults. |
7a515757 VM |
3274 | |
3275 | These are the locations and files that are checked upon starting the above | |
3276 | mentioned tools: | |
3277 | ||
3278 | @enumerate | |
3279 | ||
3280 | @item | |
3281 | The system-wide filename is called @file{gprofng.rc} and is located in | |
46c56757 | 3282 | the @file{/etc} subdirectory in case an RPM was used for the installation.. |
7a515757 VM |
3283 | |
3284 | If @ProductName{} has been built from the source, this file is in | |
3285 | subdirectory @file{etc} in the top level installation directory. | |
3286 | ||
3287 | @item | |
3288 | The user's home directory may have a hidden file called @file{.gprofng.rc}. | |
3289 | ||
3290 | @item | |
46c56757 | 3291 | The directory where @DisplayText{} (or @DisplaySRC{}) is invoked from, may |
7a515757 VM |
3292 | have a hidden file called @file{.gprofng.rc}. |
3293 | ||
3294 | @end enumerate | |
3295 | ||
3296 | The settings of each file override the settings of the file(s) read before it. | |
3297 | Defaults in the system-wide file are overruled by the file in the user home | |
3298 | directory (if any) and any settings in the @file{.gprofng.rc} file in the | |
3299 | current directory override those. | |
3300 | ||
3301 | Note that the settings in these files only affect the defaults. Unlike | |
3302 | the commands used in a script file, they are not commands for the tools. | |
3303 | ||
3304 | @c -- TBD indxobj_define, | |
3305 | @c -- TBD object_show, | |
3306 | @c -- TBD object_hide, | |
3307 | @c -- TBD object_api, | |
3308 | ||
3309 | The @file{.gprofng.rc} configuration files can contain the | |
3310 | @command{addpath}, | |
3311 | @command{compare}, | |
3312 | @command{dthresh}, | |
3313 | @command{name}, | |
3314 | @command{pathmap}, | |
3315 | @command{printmode}, | |
3316 | @command{sthresh}, | |
3317 | and | |
3318 | @command{viewmode} | |
3319 | commands as described in this user guide. | |
3320 | ||
3321 | They can also contain the following commands, @emph{which cannot be used on | |
3322 | either the command line, or in a script file}: | |
3323 | ||
3324 | @table @code | |
3325 | ||
3326 | @item dmetrics @var{metric-spec} | |
3327 | @IndexSubentry{Commands, @code{dmetrics}} | |
3328 | ||
3329 | Specify the default metrics to be displayed or printed in the function list. | |
3330 | The syntax and use of the metric list is described in section | |
3331 | @ref{Metric Definitions}. | |
3332 | The order of the metric keywords in the list determines the order in which | |
3333 | the metrics are presented. | |
3334 | ||
3335 | Default metrics for the @code{callers-callees} list are derived from the | |
3336 | function list default metrics by adding the corresponding attributed metric | |
3337 | before the first occurrence of each metric name in the list. | |
3338 | ||
3339 | @item dsort @var{metric-spec} | |
3340 | @IndexSubentry{Commands, @code{dsort}} | |
3341 | ||
3342 | Specify the default metric by which the function list is sorted. The sort | |
3343 | metric is the first metric in this list that matches a metric in any loaded | |
3344 | experiment, subject to the following conditions: | |
3345 | ||
3346 | @itemize @bullet | |
3347 | ||
3348 | @item | |
3349 | If the entry in @var{metric-spec} has a visibility string of an exclamation | |
3350 | point (@samp{!}), the first metric whose name matches is used, regardless of | |
3351 | whether it is visible. | |
3352 | ||
3353 | @item | |
3354 | If the entry in @var{metric-spec} has any other visibility string, the first | |
3355 | visible metric whose name matches is used. | |
3356 | ||
3357 | @end itemize | |
3358 | ||
3359 | The syntax and use of the metric list is described in section | |
3360 | @ref{Metric Definitions}. | |
3361 | The default sort metric for the @code{callers-callees} list is the attributed | |
3362 | metric corresponding to the default sort metric for the function list. | |
3363 | ||
3364 | @item en_desc @{on | off | =@var{regex}@} | |
3365 | @IndexSubentry{Commands, @code{en_desc}} | |
3366 | ||
3367 | Set the mode for reading descendant experiments to @samp{on} (enable all | |
3368 | descendants) or @samp{off} to disable all descendants. If | |
3369 | @samp{=}@var{regex} is used, enable data from those experiments whose | |
3370 | executable name matches the regular expression. | |
3371 | ||
3372 | The default setting is @samp{on} to follow all descendants. In reading | |
3373 | experiments with descendants, any sub-experiments that contain little or | |
3374 | no performance data are ignored by @DisplayText{}. | |
3375 | ||
3376 | @end table | |
3377 | ||
3378 | @c -- A new section ----------------------------------------------------------- | |
3379 | @node Filters | |
3380 | @section Filters | |
3381 | @c ---------------------------------------------------------------------------- | |
3382 | ||
3383 | Various filter commands are supported by @DisplayText{}. | |
3384 | Thanks to the use of filters, the user can zoom in on a certain area of | |
3385 | interest. With filters, it is possible to select one or more threads to | |
3386 | focus on, define a window in time, select specific call stacks, etc. | |
3387 | @IndexSubentry{Filters, Intro} | |
3388 | ||
3389 | While already powerful by themselves, filters may be combined to further | |
3390 | narrow down the view into the data. | |
3391 | ||
3392 | @IndexSubentry{Filters, Persistence} | |
3393 | It is important to note that filters are @emph{persistent}. A filter is | |
3394 | active until it is reset. This means that successive filter commands | |
3395 | increasingly narrow down the view until one or more are reset. | |
3396 | ||
3397 | @noindent | |
3398 | An example is the following: | |
3399 | ||
3400 | @cartouche | |
3401 | @smallexample | |
3402 | $ gprofng display text -thread_select 1 -functions \ | |
3403 | -cpu_select 2 -functions @dots{} | |
3404 | @end smallexample | |
3405 | @end cartouche | |
3406 | ||
3407 | This command selects thread 1 and requests the function view for this thread. | |
3408 | The third (@command{cpu_select 2}) command @emph{adds} the | |
3409 | constraint that only the events on CPU 2 are to be selected. This means | |
3410 | that the next function view selects events that were executed by thread 1 and | |
3411 | have been running on CPU 2. | |
3412 | ||
3413 | @noindent | |
3414 | In contrast with this single command line, the two commands below look similar, | |
3415 | but behave very differently: | |
3416 | ||
3417 | @cartouche | |
3418 | @smallexample | |
3419 | $ gprofng display text -thread_select 1 -functions @dots{} | |
3420 | $ gprofng display text -cpu_select 2 -functions @dots{} | |
3421 | @end smallexample | |
3422 | @end cartouche | |
3423 | ||
3424 | The first command displays the function view for thread 1. The second | |
3425 | command shows the function view for CPU 2 for @emph{all} threads that have | |
3426 | been running on this CPU. | |
3427 | ||
3428 | As the following example demonstrates, things get a little more tricky in | |
3429 | case a script file is used. Consider the following script file: | |
3430 | ||
3431 | @smallexample | |
3432 | @verbatim | |
3433 | thread_select 1 | |
3434 | functions | |
3435 | cpu_select 2 | |
3436 | functions | |
3437 | @end verbatim | |
3438 | @end smallexample | |
3439 | ||
3440 | This script file displays the function view for thread 1 first. This is | |
3441 | followed by those functions that were executed by thread 1 @emph{and} have | |
3442 | been run on CPU 2. | |
3443 | ||
3444 | If however, the script should behave like the two command line invocations | |
3445 | shown above, the thread selection filter needs to be reset before CPU 2 is | |
3446 | selected: | |
3447 | ||
3448 | @smallexample | |
3449 | @verbatim | |
3450 | thread_select 1 | |
3451 | functions | |
3452 | # Reset the thread selection filter: | |
3453 | thread_select all | |
3454 | cpu_select 2 | |
3455 | functions | |
3456 | @end verbatim | |
3457 | @end smallexample | |
3458 | ||
3459 | In general, filters behave differently than commands or options. In | |
3460 | particular there may be an interaction between different filter definitions. | |
3461 | ||
3462 | For example, as explained above, in the first script file the | |
3463 | @command{thread_select} and @command{cpu_select} commands interact. | |
3464 | ||
3465 | For a list of all the predefined filters see @ref{Predefined Filters}. | |
3466 | ||
3467 | @c -- A new section ----------------------------------------------------------- | |
3468 | @node Supported Environment Variables | |
3469 | @section Supported Environment Variables | |
3470 | @c ---------------------------------------------------------------------------- | |
3471 | ||
3472 | Various environment variables are supported. We refer to the man page for | |
3473 | gprofng(1) for an overview and description | |
46c56757 | 3474 | (@xref{Man page for @command{gprofng}}). |
7a515757 VM |
3475 | |
3476 | @c -- A new chapter ----------------------------------------------------------- | |
3477 | @node Performance Data Collection | |
3478 | @chapter Performance Data Collection | |
3479 | @c ---------------------------------------------------------------------------- | |
3480 | ||
3481 | The @CollectApp{} command is used to gather the application performance data | |
3482 | while the application executes. | |
3483 | At regular intervals, program execution is halted and the required data is | |
3484 | recorded. | |
3485 | @cindex Experiment directory | |
3486 | An experiment directory is created when the tool starts. This directory is | |
3487 | used to store the relevant information and forms the basis for a subsequent | |
3488 | analysis with one of the viewing tools. | |
3489 | ||
3490 | @c -- A new section ----------------------------------------------------------- | |
46c56757 VM |
3491 | @node The @command{gprofng collect app} Tool |
3492 | @section The @command{gprofng collect app} Tool | |
7a515757 VM |
3493 | @c ---------------------------------------------------------------------------- |
3494 | ||
3495 | This is the command to collect the performance information for the target | |
3496 | application. The usage is as follows: | |
3497 | ||
3498 | @cartouche | |
3499 | @smallexample | |
3500 | $ gprofng collect app [OPTION(S)] TARGET [TARGET_ARGUMENTS] | |
3501 | @end smallexample | |
3502 | @end cartouche | |
3503 | ||
3504 | @noindent | |
3505 | Options to the command are passed in first. This is followed by the name of | |
3506 | the target, which is typically a binary executable or a script, followed by | |
3507 | any options that may be required by the target. | |
3508 | ||
3509 | @c -- A new section ----------------------------------------------------------- | |
3510 | @node View the Performance Information | |
3511 | @chapter View the Performance Information | |
3512 | @c ---------------------------------------------------------------------------- | |
3513 | Various tools to view the performance data stored in one or more experiment | |
3514 | directories are available. In this chapter, these will all be covered in | |
3515 | detail. | |
3516 | ||
3517 | @c -- A new section ----------------------------------------------------------- | |
46c56757 VM |
3518 | @node The @command{gprofng display text} Tool |
3519 | @section The @command{gprofng display text} Tool | |
7a515757 VM |
3520 | @c ---------------------------------------------------------------------------- |
3521 | ||
3522 | This tool displays the performance information in ASCII format. It supports | |
3523 | a variety of views into the data recorded. | |
3524 | These views can be specified in two ways and both may be used simultaneously: | |
3525 | ||
3526 | @itemize @bullet | |
3527 | ||
3528 | @item | |
3529 | Command line options start with a dash (@samp{-}) symbol and may take an | |
3530 | argument. | |
3531 | ||
3532 | @item | |
3533 | Options may also be included in a file, the ``script file''. In this case, | |
3534 | the dash symbol should @emph{not} be included. Multiple script files can | |
3535 | be used on the same command line. | |
3536 | ||
3537 | @end itemize | |
3538 | ||
3539 | While they may appear as an option, they are really commands and this is | |
3540 | why they will be referred to as @emph{commands} | |
3541 | @cindex Commands | |
3542 | in the documentation. | |
3543 | ||
3544 | As a general rule, @emph{the order of options matters} and if the same option, | |
3545 | or command, occurs multiple times, the rightmost setting is selected. | |
3546 | ||
3547 | @c -- A new sub section ------------------------------------------------------- | |
3548 | @node The @code{gprofng display text} Commands | |
3549 | @subsection The @code{gprofng display text} Commands | |
3550 | @c ---------------------------------------------------------------------------- | |
3551 | ||
3552 | The most commonly used commands are documented in the man page for this tool | |
46c56757 VM |
3553 | (@xref{Man page for @command{gprofng display text}}). In this section we list |
3554 | and describe all other commands that are supported. | |
7a515757 VM |
3555 | |
3556 | @c -- A new sub subsection ---------------------------------------------------- | |
3557 | @node Commands that List Experiment Details | |
3558 | @unnumberedsubsubsec Commands that List Experiment Details | |
3559 | @c ---------------------------------------------------------------------------- | |
3560 | ||
3561 | @table @code | |
3562 | ||
3563 | @item experiment_ids | |
3564 | @ifclear man | |
3565 | @IndexSubentry{Options, @code{-experiment_ids}} | |
3566 | @IndexSubentry{Commands, @code{experiment_ids}} | |
3567 | @end ifclear | |
3568 | ||
3569 | For each experiment that has been loaded, show the totals of the metrics | |
3570 | recorded, plus some other operational characteristics like the name of | |
3571 | the executable, PID, etc. The top line contains the accumulated totals | |
3572 | for the metrics. | |
3573 | ||
3574 | @item experiment_list | |
3575 | @ifclear man | |
3576 | @IndexSubentry{Options, @code{-experiment_list}} | |
3577 | @IndexSubentry{Commands, @code{experiment_list}} | |
3578 | @end ifclear | |
3579 | ||
3580 | Display the list of experiments that are loaded. Each experiment is listed | |
3581 | with an index, which is used when selecting samples, threads, or LWPs, and | |
3582 | a process id (PID), which can be used for advanced filtering. | |
3583 | ||
3584 | @item cpu_list | |
3585 | @IndexSubentry{Options, @code{-cpu_list}} | |
3586 | @IndexSubentry{Commands, @code{cpu_list}} | |
3587 | ||
3588 | Display the total number of CPUs that have been used during the experiment(s). | |
3589 | ||
3590 | @item cpus | |
3591 | @IndexSubentry{Options, @code{-cpus}} | |
3592 | @IndexSubentry{Commands, @code{cpus}} | |
3593 | ||
3594 | Show a list of CPUs that were used by the application, along with the metrics | |
3595 | that have been recorded. The CPUs are represented by a CPU number and show the | |
3596 | Total CPU time by default. | |
3597 | ||
3598 | Note that since the data is sorted with respect to the default metric, it may | |
3599 | be useful to use the @command{sort name} command to show the list sorted with | |
3600 | respect to the CPU id. | |
3601 | ||
3602 | @item GCEvents | |
3603 | @IndexSubentry{Options, @code{-GCEvents}} | |
3604 | @IndexSubentry{Commands, @code{GCEvents}} | |
3605 | ||
3606 | This commands is for Java applications only. It shows any Garbage Collection | |
3607 | (GC) events that have occurred while the application was executing.. | |
3608 | ||
3609 | @item lwp_list | |
3610 | @IndexSubentry{Options, @code{-lwp_list}} | |
3611 | @IndexSubentry{Commands, @code{lwp_list}} | |
3612 | ||
3613 | Displays the list of LWPs processed during the experiment(s). | |
3614 | ||
3615 | @item processes | |
3616 | @IndexSubentry{Options, @code{-processes}} | |
3617 | @IndexSubentry{Commands, @code{processes}} | |
3618 | ||
3619 | For each experiment that has been loaded, this command displays a list of | |
3620 | processes that were created by the application, along with their metrics. | |
3621 | The processes are represented by process ID (PID) numbers and show the | |
3622 | Total CPU time metric by default. If additional metrics are recorded in | |
3623 | an experiment, these are shown as well. | |
3624 | ||
3625 | @item samples | |
3626 | @IndexSubentry{Options, @code{-samples}} | |
3627 | @IndexSubentry{Commands, @code{samples}} | |
3628 | ||
3629 | Display a list of sample points and their metrics, which reflect the | |
3630 | microstates recorded at each sample point in the loaded experiment. | |
3631 | The samples are represented by sample numbers and show the Total CPU time | |
3632 | by default. Other metrics might also be displayed if enabled. | |
3633 | ||
3634 | @item sample_list | |
3635 | @IndexSubentry{Options, @code{-sample_list}} | |
3636 | @IndexSubentry{Commands, @code{sample_list}} | |
3637 | ||
3638 | For each experiment loaded, display the list of samples currently selected. | |
3639 | ||
3640 | @item seconds | |
3641 | @IndexSubentry{Options, @code{-seconds}} | |
3642 | @IndexSubentry{Commands, @code{seconds}} | |
3643 | ||
3644 | Show each second of the profiling run that was captured in the experiment, | |
3645 | along with the metrics collected in that second. The seconds view differs | |
3646 | from the samples view in that it shows periodic samples that occur every | |
3647 | second beginning at 0 and the interval cannot be changed. | |
3648 | ||
3649 | The seconds view lists the seconds of execution with the Total CPU time by | |
3650 | default. Other metrics might also be displayed if the metrics are present | |
3651 | in the loaded experiments. | |
3652 | ||
3653 | @item threads | |
3654 | @IndexSubentry{Options, @code{-threads}} | |
3655 | @IndexSubentry{Commands, @code{threads}} | |
3656 | ||
3657 | Show a list of threads and their metrics. The threads are represented | |
3658 | by a process and thread pair and show the Total CPU time by default. | |
3659 | Other metrics might also be displayed by default if the metrics are | |
3660 | present in the loaded experiment. | |
3661 | ||
3662 | @item thread_list | |
3663 | @IndexSubentry{Options, @code{-thread_list}} | |
3664 | @IndexSubentry{Commands, @code{thread_list}} | |
3665 | ||
3666 | Display the list of threads currently selected for the analysis. | |
3667 | ||
3668 | @end table | |
3669 | ||
3670 | @noindent | |
3671 | @emph{The commands below are for use in scripts and interactive mode only. | |
3672 | They are not allowed on the command line.} | |
3673 | ||
3674 | @table @code | |
3675 | ||
3676 | @item add_exp @var{exp-name} | |
3677 | @IndexSubentry{Commands, @code{add_exp}} | |
3678 | ||
3679 | Add the named experiment to the current session. | |
3680 | ||
3681 | @item drop_exp @var{exp-name} | |
3682 | @IndexSubentry{Commands, @code{drop_exp}} | |
3683 | ||
3684 | Drop the named experiment from the current session. | |
3685 | ||
3686 | @item open_exp @var{exp-name} | |
3687 | @IndexSubentry{Commands, @code{open_exp}} | |
3688 | ||
3689 | Drop all loaded experiments from the session, and then load the named | |
3690 | experiment. | |
3691 | ||
3692 | @end table | |
3693 | ||
3694 | @c -- A new sub subsection ---------------------------------------------------- | |
3695 | @node Commands that Affect Listings and Output | |
3696 | @unnumberedsubsubsec Commands that Affect Listings and Output | |
3697 | @c ---------------------------------------------------------------------------- | |
3698 | ||
3699 | @table @code | |
3700 | ||
3701 | @item dthresh @var{value} | |
3702 | @IndexSubentry{Options, @code{-dthresh}} | |
3703 | @IndexSubentry{Commands, @code{dthresh}} | |
3704 | ||
3705 | Specify the threshold percentage for highlighting metrics in the annotated | |
3706 | disassembly code. If the value of any metric is equal to or greater than | |
3707 | @var{value} as a percentage of the maximum value of that metric for any | |
3708 | instruction line in the file, the line on which the metrics occur has a | |
3709 | @samp{##} marker inserted at the beginning of the line. The default is 75. | |
3710 | ||
3711 | @item printmode @{text | html | @var{single-char}@} | |
3712 | @IndexSubentry{Options, @code{-printmode}} | |
3713 | @IndexSubentry{Commands, @code{printmode}} | |
3714 | ||
3715 | Set the print mode. If the keyword is @code{text}, printing will be done in | |
3716 | tabular form using plain text. In case the @code{html} keyword is selected, | |
3717 | the output is formatted as an HTML table. | |
3718 | ||
3719 | Alternatively, @var{single-char} may be used in a delimiter separated list, | |
3720 | with the single character @var{single-char} as the delimiter. | |
3721 | ||
3722 | The printmode setting is used only for those commands that generate tables, | |
3723 | such as @command{functions}. The setting is ignored for other printing | |
3724 | commands, including those showing source and disassembly listings. | |
3725 | ||
3726 | @item sthresh @var{value} | |
3727 | @IndexSubentry{Options, @code{-sthresh}} | |
3728 | @IndexSubentry{Commands, @code{sthresh}} | |
3729 | ||
3730 | Specify the threshold percentage for highlighting metrics in the annotated | |
3731 | source code. If the value of any metric is equal to or greater than | |
3732 | @var{value} (as a percentage) of the maximum value of that metric for any | |
3733 | source line in the file, the line on which the metrics occur has a @samp{##} | |
3734 | marker inserted at the beginning of the line. The default is 75. | |
3735 | ||
3736 | @end table | |
3737 | ||
3738 | @c -- A new sub subsection ---------------------------------------------------- | |
3739 | @node Predefined Filters | |
3740 | @unnumberedsubsubsec Predefined Filters | |
3741 | @c ---------------------------------------------------------------------------- | |
3742 | ||
3743 | The filters below use a list, the selection list, to define a sequence of | |
3744 | numbers. @xref{The Selection List}. | |
3745 | Note that this selection is persistent, but the filter can be reset by using | |
3746 | @samp{all} as the @var{selection-list}. | |
3747 | ||
3748 | @table @code | |
3749 | ||
3750 | @item cpu_select @var{selection-list} | |
3751 | @IndexSubentry{Options, @code{-cpu_select}} | |
3752 | @IndexSubentry{Commands, @code{cpu_select}} | |
3753 | ||
3754 | Select the CPU ids specified in the @var{selection-list}. | |
3755 | ||
3756 | @item lwp_select @var{selection-list} | |
3757 | @IndexSubentry{Options, @code{-lwp_select}} | |
3758 | @IndexSubentry{Commands, @code{lwp_select}} | |
3759 | ||
3760 | Select the LWPs specified in the @var{selection-list}. | |
3761 | ||
3762 | @item sample_select @var{selection-list} | |
3763 | @IndexSubentry{Options, @code{-sample-select}} | |
3764 | @IndexSubentry{Commands, @code{sample-select}} | |
3765 | ||
3766 | @item thread_select @var{selection-list} | |
3767 | @IndexSubentry{Options, @code{-thread_select}} | |
3768 | @IndexSubentry{Commands, @code{thread_select}} | |
3769 | ||
3770 | Select a series of threads, or just one, to be used in subsequent views. | |
3771 | The @var{selection-list} consists of a sequence of comma separated numbers. | |
3772 | This may include a range of the form @samp{n-m}. | |
3773 | ||
3774 | @end table | |
3775 | ||
3776 | ||
3777 | @c -- A new sub subsection ---------------------------------------------------- | |
3778 | @node Commands to Set and Change Search Paths | |
3779 | @unnumberedsubsubsec Commands to Set and Change Search Paths | |
3780 | @c ---------------------------------------------------------------------------- | |
3781 | ||
3782 | @table @code | |
3783 | ||
3784 | @item addpath @var{path-list} | |
3785 | @IndexSubentry{Options, @code{-addpath}} | |
3786 | @IndexSubentry{Commands, @code{addpath}} | |
3787 | ||
3788 | Append @var{path-list} to the current setpath settings. Note that multiple | |
3789 | @command{addpath} commands can be used in @file{.gprofng.rc} files, and will | |
3790 | be concatenated. | |
3791 | ||
3792 | @item pathmap @var{old-prefix} @var{new-prefix} | |
3793 | @IndexSubentry{Options, @code{-pathmap}} | |
3794 | @IndexSubentry{Commands, @code{pathmap}} | |
3795 | ||
3796 | If a file cannot be found using the path list set by @command{addpath}, or | |
3797 | the @command{setpath} command, one or more path remappings may be set with the | |
3798 | @command{pathmap} command. | |
3799 | ||
3800 | With path mapping, the user can specify how to replace the leading component | |
3801 | in a full path by a different string. | |
3802 | ||
3803 | With this command, any path name for a source file, object file, or shared | |
3804 | object that begins with the prefix specified with @var{old-prefix}, the | |
3805 | old prefix is replaced by the prefix specified with @var{new-prefix}. | |
3806 | The resulting path is used to find the file. | |
3807 | ||
3808 | For example, if a source file located in directory @file{/tmp} | |
3809 | is shown in the @DisplayText{} output, but should instead be taken from | |
3810 | @file{/home/demo}, the following @file{pathmap} command redefines the | |
3811 | path: | |
3812 | ||
3813 | @smallexample | |
3814 | $ gprofng diplay text -pathmap /tmp /home/demo -source ... | |
3815 | @end smallexample | |
3816 | ||
3817 | Note that multiple @command{pathmap} commands can be supplied, and each is | |
3818 | tried until the file is found. | |
3819 | ||
3820 | @item setpath @var{path-list} | |
3821 | @IndexSubentry{Options, @code{-setpath}} | |
3822 | @IndexSubentry{Commands, @code{setpath}} | |
3823 | ||
3824 | Set the path used to find source and object files. The path is defined | |
3825 | through the @var{path-list} keyword. It is a colon separated list of | |
3826 | directories, jar files, or zip files. | |
3827 | If any directory has a colon character in it, escape it with a | |
3828 | backslash (@samp{\}). | |
3829 | ||
3830 | The special directory name @code{$expts}, refers | |
3831 | to the set of current experiments in the order in which they were loaded. | |
3832 | You can abbreviate it with a single @samp{$} character. | |
3833 | ||
3834 | The default path is @samp{$expts:..} which is the directories of the | |
3835 | loaded experiments and the current working directory. | |
3836 | ||
3837 | Use @command{setpath} with no argument to display the current path. | |
3838 | ||
3839 | Note that @command{setpath} commands @emph{are not allowed .gprofng.rc | |
3840 | configuration files}. | |
3841 | ||
3842 | @end table | |
3843 | ||
3844 | @c -- A new subsection -------------------------------------------------------- | |
3845 | @c -- TBD @node Usage examples for @code{gprofng display text} | |
3846 | @c -- TBD @subsection Usage examples for @code{gprofng display text} | |
3847 | @c ---------------------------------------------------------------------------- | |
3848 | ||
3849 | @c -- TBD In this section we present usage examples. | |
3850 | ||
3851 | @c -- A new chapter ----------------------------------------------------------- | |
3852 | @c TBD @node The @code{gprofng display html} Tool | |
3853 | @c TBD @section The @code{gprofng display html} Tool | |
3854 | @c ---------------------------------------------------------------------------- | |
3855 | @c TBD The options are documented in the man page for this tool. In this section we | |
3856 | @c TBD present usage examples. | |
3857 | ||
3858 | @c -- A new chapter ----------------------------------------------------------- | |
3859 | @c TBD @node Display Source Code | |
3860 | @c TBD @chapter Display Source Code | |
3861 | @c ---------------------------------------------------------------------------- | |
3862 | @c TBD The options are documented in the man page for this tool. In this section we | |
3863 | @c TBD present usage examples. | |
3864 | ||
3865 | ||
3866 | @c -- A new chapter ----------------------------------------------------------- | |
3867 | @c TBD @node Archive Experiment Data | |
3868 | @c TBD @chapter Archive Experiment Data | |
3869 | @c ---------------------------------------------------------------------------- | |
3870 | @c TBD The options are documented in the man page for this tool. In this section we | |
3871 | @c TBD present usage examples. | |
3872 | ||
3873 | @c -- A new chapter ----------------------------------------------------------- | |
3874 | @node Terminology | |
3875 | @chapter Terminology | |
3876 | @c ---------------------------------------------------------------------------- | |
3877 | ||
3878 | Throughout this manual, certain terminology specific to profiling tools, | |
3879 | or @ToolName{}, or even to this document only, is used. In this chapter | |
3880 | this terminology is explained in detail. | |
3881 | ||
3882 | @menu | |
3883 | * The Program Counter:: What is a Program Counter? | |
3884 | * Inclusive and Exclusive Metrics:: An explanation of inclusive and exclusive metrics. | |
3885 | * Metric Definitions:: Definitions associated with metrics. | |
3886 | * The Viewmode:: Select the way call stacks are presented. | |
3887 | * The Selection List:: How to define a selection. | |
3888 | * Load Objects and Functions:: The components in an application. | |
3889 | * The Concept of a CPU in gprofng:: The definition of a CPU. | |
3890 | * Hardware Event Counters Explained:: What are event counters? | |
3891 | * apath:: Our generic definition of a path. | |
3892 | @end menu | |
3893 | ||
3894 | @c ---------------------------------------------------------------------------- | |
3895 | @node The Program Counter | |
3896 | @section The Program Counter | |
3897 | @c ---------------------------------------------------------------------------- | |
3898 | ||
3899 | @cindex PC | |
3900 | @cindex Program Counter | |
3901 | The @emph{Program Counter}, or PC for short, keeps track where program execution is. | |
3902 | The address of the next instruction to be executed is stored in a special | |
3903 | purpose register in the processor, or core. | |
3904 | ||
3905 | @cindex Instruction pointer | |
3906 | The PC is sometimes also referred to as the @emph{instruction pointer}, but | |
3907 | we will use Program Counter or PC throughout this document. | |
3908 | ||
3909 | @c ---------------------------------------------------------------------------- | |
3910 | @node Inclusive and Exclusive Metrics | |
3911 | @section Inclusive and Exclusive Metrics | |
3912 | @c ---------------------------------------------------------------------------- | |
3913 | ||
3914 | In the remainder, these two concepts occur quite often and for lack of a better | |
3915 | place, they are explained here. | |
3916 | ||
3917 | @cindex Inclusive metric | |
3918 | The @emph{inclusive} value for a metric includes all values that are part of | |
3919 | the dynamic extent of the target function. For example if function @code{A} | |
3920 | calls functions @code{B} and @code{C}, the inclusive CPU time for @code{A} | |
3921 | includes the CPU time spent in @code{B} and @code{C}. | |
3922 | ||
3923 | @cindex Exclusive metric | |
3924 | In contrast with this, the @emph{exclusive} value for a metric is computed | |
3925 | by excluding the metric values used by other functions called. In our imaginary | |
3926 | example, the exclusive CPU time for function @code{A} is the time spent outside | |
3927 | calling functions @code{B} and @code{C}. | |
3928 | ||
3929 | @cindex Leaf function | |
3930 | In case of a @emph{leaf function}, the inclusive and exclusive values for the | |
3931 | metric are the same since by definition, it is not calling any other | |
3932 | function(s). | |
3933 | ||
3934 | Why do we use these two different values? The inclusive metric shows the most | |
3935 | expensive path, in terms of this metric, in the application. For example, if | |
3936 | the metric is cache misses, the function with the highest inclusive metric | |
3937 | tells you where most of the cache misses come from. | |
3938 | ||
3939 | Within this branch of the application, the exclusive metric points to the | |
3940 | functions that contribute and help to identify which part(s) to consider | |
3941 | for further analysis. | |
3942 | ||
3943 | @c ---------------------------------------------------------------------------- | |
3944 | @node Metric Definitions | |
3945 | @section Metric Definitions | |
3946 | @c ---------------------------------------------------------------------------- | |
3947 | The metrics displayed in the various views are highly customizable. In this | |
3948 | section it is explained how to construct the metrics definition(s). | |
3949 | ||
3950 | @IndexSubentry{Options, @code{-metrics}} | |
3951 | @IndexSubentry{Commands, @code{metrics}} | |
3952 | The @command{metrics} command takes a colon (@samp{:}) separated list, where | |
3953 | each item in the list consists of the following three fields: | |
3954 | @var{<flavor>}@var{<visibility>}@var{<metric-name>}. | |
3955 | ||
3956 | @cindex Flavor field | |
3957 | @cindex Visibility field | |
3958 | @cindex Metric name field | |
3959 | @IndexSubentry{Metrics, Flavor field} | |
3960 | @IndexSubentry{Metrics, Visibility field} | |
3961 | @IndexSubentry{Metrics, Metric name field} | |
3962 | The @var{<flavor>} field is either @samp{e} for ``exclusive'', and/or | |
3963 | @samp{i} for ``inclusive''. The @var{<metric-name>} field is the name of | |
3964 | the metric and the @var{<visibility>} field consists of one ore more characters | |
3965 | from the following table: | |
3966 | ||
3967 | @table @code | |
3968 | ||
3969 | @item . | |
3970 | Show the metric as time. This applies to timing metrics and hardware event | |
3971 | counters that measure cycles. Interpret as @samp{+} for other metrics. | |
3972 | ||
3973 | @item % | |
3974 | Show the metric as a percentage of the total value for this metric. | |
3975 | ||
3976 | @item + | |
3977 | Show the metric as an absolute value. For hardware event counters this is | |
3978 | the event count. Interpret as @samp{.} for timing metrics. | |
3979 | ||
3980 | @item ! | |
3981 | Do not show any metric value. Cannot be used with other visibility characters. | |
3982 | This visibility is meant to be used in a @command{dmetrics} command to set | |
3983 | default metrics that override the built-in visibility defaults | |
3984 | for each type of metric. | |
3985 | ||
3986 | @end table | |
3987 | ||
3988 | Both the @var{<flavor>} and @var{<visibility>} strings may have more than one | |
3989 | character. If both strings have more than one character, the @var{<flavor>} | |
3990 | string is expanded first. For example, @code{ie.%user} is first expanded to | |
3991 | @code{i.%user:e.%user}, which is then expanded into | |
3992 | @code{i.user:i%user:e.user:e%user}. | |
3993 | ||
3994 | @c ---------------------------------------------------------------------------- | |
3995 | @node The Viewmode | |
3996 | @section The Viewmode | |
3997 | ||
3998 | @cindex Viewmode | |
3999 | @IndexSubentry{Options, @code{-viewmode}} | |
4000 | @IndexSubentry{Commands, @code{viewmode}} | |
4001 | ||
4002 | There are different ways to view a call stack in Java. In @ToolName{}, this | |
4003 | is called the @emph{viewmode} and the setting is controlled through a command | |
4004 | with the same name. | |
4005 | ||
4006 | The @code{viewmode} command takes one of the following keywords: | |
4007 | ||
4008 | @table @code | |
4009 | ||
4010 | @item user | |
4011 | This is the default and shows the Java call stacks for Java threads. | |
4012 | No call stacks for any housekeeping threads are shown. The function | |
4013 | list contains a function | |
4014 | @IndexSubentry{Java profiling, @code{<JVM-System>}} | |
4015 | @code{<JVM-System>} that represents the aggregated time from non-Java | |
4016 | threads. | |
4017 | When the JVM software does not report a Java call stack, time is reported | |
4018 | against the function | |
4019 | @IndexSubentry{Java profiling, @code{<no Java callstack recorded>}} | |
4020 | @code{<no Java callstack recorded>}. | |
4021 | ||
4022 | @item expert | |
4023 | Show the Java call stacks for Java threads when the Java code from the | |
4024 | user is executed and machine call stacks when JVM code is executed, or | |
4025 | when the JVM software does not report a Java call stack. | |
4026 | Show the machine call stacks for housekeeping threads. | |
4027 | ||
4028 | @item machine | |
4029 | Show the actual native call stacks for all threads. | |
4030 | ||
4031 | @end table | |
4032 | ||
4033 | @c ---------------------------------------------------------------------------- | |
4034 | @node The Selection List | |
4035 | @section The Selection List | |
4036 | @c ---------------------------------------------------------------------------- | |
4037 | ||
4038 | @cindex Selection list | |
4039 | @cindex List specification | |
4040 | Several commands allow the user to specify a sequence of numbers called the | |
4041 | @emph{selection list}. Such a list may for example be used to select specific | |
4042 | threads from all the threads that have been used when conducting the | |
4043 | experiment(s). | |
4044 | ||
4045 | A selection list (or ``list'' in the remainder of this section) can be a | |
4046 | single number, a contiguous range of numbers with the start and end numbers | |
4047 | separated by a hyphen (@samp{-}), a comma-separated list of numbers and | |
4048 | ranges, or the @code{all} keyword that resets the filter. | |
4049 | @IndexSubentry{Filters, Reset to default} | |
4050 | Lists must not contain spaces. | |
4051 | ||
4052 | Each list can optionally be preceded by an experiment list with a similar | |
4053 | format, separated from the list by a colon (:). | |
4054 | If no experiment list is included, the list applies to all experiments. | |
4055 | ||
4056 | Multiple lists can be concatenated by separating the individual lists | |
4057 | by a plus sign. | |
4058 | ||
4059 | These are some examples of various filters using a list: | |
4060 | ||
4061 | @table @code | |
4062 | ||
4063 | @item thread_select 1 | |
4064 | Select thread 1 from all experiments. | |
4065 | ||
4066 | @item thread_select all:1 | |
4067 | Select thread 1 from all experiments. | |
4068 | ||
4069 | @item thread_select 1:all | |
4070 | Select all the threads from the first experiment loaded. | |
4071 | ||
4072 | @item thread_select 1:2+3:4 | |
4073 | Select thread 2 from experiment 1 and thread 4 from experiment 3. | |
4074 | ||
4075 | @item cpu_select all:1,3,5 | |
4076 | Selects cores 1, 3, and 5 from all experiments. | |
4077 | ||
4078 | @item cpu_select 1,2:all | |
4079 | Select all cores from experiments 1 and 2. | |
4080 | ||
4081 | @end table | |
4082 | ||
4083 | Recall that there are several list commands that show the mapping between the | |
4084 | numbers and the targets. | |
4085 | ||
4086 | @IndexSubentry{Options, @code{-experiment_list}} | |
4087 | @IndexSubentry{Commands, @code{experiment_list}} | |
4088 | For example, the @command{experiment_list} command shows the name(s) of the | |
4089 | experiment(s) loaded and the associated number. In this example it is used | |
4090 | to get this information for a range of experiments: | |
4091 | ||
4092 | @cartouche | |
4093 | @smallexample | |
4094 | $ gprofng display text -experiment_list mxv.?.thr.er | |
4095 | @end smallexample | |
4096 | @end cartouche | |
4097 | ||
4098 | @noindent | |
4099 | This is the output, showing for each experiment the ID, the PID, and the name: | |
4100 | ||
4101 | @smallexample | |
4102 | @verbatim | |
4103 | ID Sel PID Experiment | |
4104 | == === ======= ============ | |
4105 | 1 yes 2750071 mxv.1.thr.er | |
4106 | 2 yes 1339450 mxv.2.thr.er | |
4107 | 3 yes 3579561 mxv.4.thr.er | |
4108 | @end verbatim | |
4109 | @end smallexample | |
4110 | ||
4111 | @c ---------------------------------------------------------------------------- | |
4112 | @node Load Objects and Functions | |
4113 | @section Load Objects and Functions | |
4114 | @c ---------------------------------------------------------------------------- | |
4115 | ||
4116 | An application consists of various components. The source code files are | |
4117 | compiled into object files. These are then glued together at link time to form | |
4118 | the executable. | |
4119 | During execution, the program may also dynamically load objects. | |
4120 | ||
4121 | @cindex Load object | |
4122 | A @emph{load object} is defined to be an executable, or shared object. A shared | |
4123 | library is an example of a load object in @ToolName{}. | |
4124 | ||
4125 | Each load object, contains a text section with the instructions generated by the | |
4126 | compiler, a data section for data, and various symbol tables. | |
4127 | All load objects must contain an | |
4128 | @cindex ELF | |
4129 | ELF | |
4130 | symbol table, which gives the names and addresses of all the globally known | |
4131 | functions in that object. | |
4132 | ||
4133 | Load objects compiled with the -g option contain additional symbolic information | |
4134 | that can augment the ELF symbol table and provide information about functions that | |
4135 | are not global, additional information about object modules from which the functions | |
4136 | came, and line number information relating addresses to source lines. | |
4137 | ||
4138 | The term | |
4139 | @cindex Function | |
4140 | @emph{function} | |
4141 | is used to describe a set of instructions that represent a high-level operation | |
4142 | described in the source code. The term also covers methods as used in C++ and in | |
4143 | the Java programming language. | |
4144 | ||
4145 | In the @ToolName{} context, functions are provided in source code format. | |
4146 | Normally their names appear in the symbol table representing a set of addresses. | |
4147 | @cindex Program Counter | |
4148 | @cindex PC | |
4149 | If the Program Counter (PC) is within that set, the program is executing within that function. | |
4150 | ||
4151 | In principle, any address within the text segment of a load object can be mapped to a | |
4152 | function. Exactly the same mapping is used for the leaf PC and all the other PCs on the | |
4153 | call stack. | |
4154 | ||
4155 | Most of the functions correspond directly to the source model of the program, but | |
4156 | there are exceptions. This topic is however outside of the scope of this guide. | |
4157 | ||
4158 | @c ---------------------------------------------------------------------------- | |
46c56757 VM |
4159 | @node The Concept of a CPU in gprofng |
4160 | @section The Concept of a CPU in gprofng | |
7a515757 VM |
4161 | @c ---------------------------------------------------------------------------- |
4162 | ||
4163 | @cindex CPU | |
4164 | In @ProductName{}, there is the concept of a CPU. Admittedly, this is not the | |
4165 | best word to describe what is meant here and may be replaced in the future. | |
4166 | ||
4167 | The word CPU is used in many of the displays. | |
4168 | In the context of @ProductName{}, it is meant to denote a part of the | |
4169 | processor that is capable of executing instructions and with its own state, | |
4170 | like the program counter. | |
4171 | ||
4172 | For example, on a contemporary processor, a CPU could be a core. In case | |
4173 | hardware threads are supported within a core, a CPU is one of those | |
4174 | hardware threads. | |
4175 | ||
4176 | To see which CPUs have been used in the experiment, use the @command{cpu} | |
4177 | command in @DisplayText{}. | |
4178 | ||
4179 | @c ---------------------------------------------------------------------------- | |
4180 | @node Hardware Event Counters Explained | |
4181 | @section Hardware Event Counters Explained | |
4182 | @c ---------------------------------------------------------------------------- | |
4183 | ||
4184 | @IndexSubentry{Hardware event counters, description} | |
4185 | For quite a number of years now, many microprocessors have supported hardware | |
4186 | event counters. | |
4187 | ||
4188 | On the hardware side, this means that in the processor there are one or more | |
4189 | registers dedicated to count certain activities, or ``events''. | |
4190 | Examples of such events are the number of instructions executed, or the number | |
4191 | of cache misses at level 2 in the memory hierarchy. | |
4192 | ||
4193 | While there is a limited set of such registers, the user can map events onto | |
4194 | them. In case more than one register is available, this allows for the | |
4195 | simultaenous measurement of various events. | |
4196 | ||
4197 | A simple, yet powerful, example is to simultaneously count the number of CPU | |
4198 | cycles and the number of instructions excuted. These two numbers can then be | |
4199 | used to compute the | |
4200 | @cindex IPC | |
4201 | @emph{IPC} value. IPC stands for ``Instructions Per Clockcycle'' and each processor | |
4202 | has a maximum. For example, if this maximum number is 2, it means the | |
4203 | processor is capable of executing two instructions every clock cycle. | |
4204 | ||
4205 | Whether this is actually achieved, depends on several factors, including the | |
4206 | instruction characteristics. | |
4207 | However, in case the IPC value is well below this maximum in a time critical | |
4208 | part of the application and this cannot be easily explained, further | |
4209 | investigation is probably warranted. | |
4210 | ||
4211 | @cindex CPI | |
4212 | A related metric is called @emph{CPI}, or ``Clockcycles Per Instruction''. | |
4213 | It is the inverse of the CPI and can be compared against the theoretical | |
4214 | value(s) of the target instruction(s). A significant difference may point | |
4215 | at a bottleneck. | |
4216 | ||
4217 | One thing to keep in mind is that the value returned by a counter can either | |
4218 | be the number of times the event occured, or a CPU cycle count. In case of | |
4219 | the latter it is possible to convert this number to time. | |
4220 | ||
4221 | @IndexSubentry{Hardware event counters, variable CPU frequency} | |
4222 | This is often easier to interpret than a simple count, but there is one | |
4223 | caveat to keep in mind. The CPU frequency may not have been constant while | |
4224 | the experimen was recorded and this impacts the time reported. | |
4225 | ||
4226 | These event counters, or ``counters'' for short, provide great insight into | |
4227 | what happens deep inside the processor. In case higher level information does | |
4228 | not provide the insight needed, the counters provide the information to get | |
4229 | to the bottom of a performance problem. | |
4230 | ||
4231 | There are some things to consider though. | |
4232 | ||
4233 | @itemize @bullet | |
4234 | ||
4235 | @item | |
4236 | The event definitions and names vary across processors and it may even happen | |
4237 | that some events change with an update. | |
4238 | Unfortunately and this is luckily rare, there are sometimes bugs causing the | |
4239 | wrong count to be returned. | |
4240 | ||
4241 | @IndexSubentry{Hardware event counters, alias name} | |
4242 | In @ToolName{}, some of the processor specific event names have an alias | |
4243 | name. For example @code{insts} measures the instructions executed. | |
4244 | These aliases not only makes it easier to identify the functionality, but also | |
4245 | provide portability of certain events across processors. | |
4246 | ||
4247 | @item | |
4248 | Another complexity is that there are typically many events one can monitor. | |
4249 | There may up to hundreds of events available and it could require several | |
4250 | experiments to zoom in on the root cause of a performance problem. | |
4251 | ||
4252 | @item | |
4253 | There may be restrictions regarding the mapping of event(s) onto the | |
4254 | counters. For example, certain events may be restricted to specific | |
4255 | counters only. As a result, one may have to conduct additional experiments | |
4256 | to cover all the events of interest. | |
4257 | ||
4258 | @item | |
4259 | The names of the events may also not be easy to interpret. In such cases, | |
4260 | the description can be found in the architecture manual for the processor. | |
4261 | ||
4262 | @end itemize | |
4263 | ||
4264 | Despite these drawbacks, hardware event counters are extremely useful and | |
4265 | may even turn out to be indispensable. | |
4266 | ||
4267 | @c ---------------------------------------------------------------------------- | |
4268 | @node apath | |
4269 | @section What is <apath>? | |
4270 | @c ---------------------------------------------------------------------------- | |
4271 | ||
4272 | In most cases, @ToolName{} shows the absolute pathnames of directories. These | |
4273 | tend to be rather long, causing display issues in this document. | |
4274 | ||
4275 | Instead of wrapping these long pathnames over multiple lines, we decided to | |
4276 | represent them by the @code{<apath>} symbol, which stands for ``an absolute | |
4277 | pathname''. | |
4278 | ||
4279 | Note that different occurrences of @code{<apath>} may represent different | |
4280 | absolute pathnames. | |
4281 | ||
4282 | @c -- A new node -------------------------------------------------------------- | |
4283 | @node Other Document Formats | |
4284 | @chapter Other Document Formats | |
4285 | @c ---------------------------------------------------------------------------- | |
4286 | ||
4287 | @emph{This chapter is applicable when building gprofng from the | |
4288 | binutils source.} | |
4289 | ||
4290 | This document is written in Texinfo and the source text is made available as | |
4291 | part of the binutils distribution. The file name is @code{gprofng.texi} and | |
4292 | can be found in subdirectory @code{gprofng/doc} of the top level binutils | |
4293 | directory. | |
4294 | ||
4295 | The default installation procedure creates a file in the @code{info} format and | |
4296 | stores it in the documentation section of binutils. | |
4297 | This source file can however also be used to generate the document in the | |
4298 | @code{html} and @code{pdf} formats. These may be easier to read and search. | |
4299 | ||
4300 | To generate this documentation file in a different format, go to the directory | |
4301 | that was used to build the tools. The make file to build the other formats is | |
4302 | in the @code{gprofng/doc} subdirectory. | |
4303 | ||
4304 | For example, if you have set the build directory to be @var{<my-build-dir>}, | |
4305 | go to subdirectory @var{<my-build-dir>/gprofng/doc}. | |
4306 | ||
4307 | This subdirectory has a single filed called @file{Makefile} that can be used to | |
4308 | build the documentation in various formats. We recommend to use these commands. | |
4309 | ||
4310 | There are four commands to generate the documentation in the @code{html} or | |
4311 | @code{pdf} format. It is assumed that you are in directory @code{gprofng/doc} | |
4312 | under the main directory @var{<my-build-dir>}. | |
4313 | ||
4314 | @table @code | |
4315 | ||
4316 | @item make html | |
4317 | Create the html file in the current directory. | |
4318 | ||
4319 | @item make pdf | |
4320 | Create the pdf file in the current directory. | |
4321 | ||
4322 | @item make install-html | |
4323 | Create and install the html file in the binutils documentation directory. | |
4324 | ||
4325 | @item make install-pdf | |
4326 | Creat and install the pdf file in the binutils documentation directory. | |
4327 | ||
4328 | @end table | |
4329 | ||
4330 | For example, to install this document in the binutils documentation directory, the | |
4331 | commands below may be executed. In this notation, @var{<format>} | |
4332 | is one of @code{html}, or @code{pdf}: | |
4333 | ||
4334 | @smallexample | |
4335 | @verbatim | |
4336 | $ cd <my-build-dir>/gprofng/doc | |
4337 | $ make install-<format> | |
4338 | @end verbatim | |
4339 | @end smallexample | |
4340 | ||
4341 | The binutils installation directory is either the default @code{/usr/local} or the one | |
4342 | that has been set with the @code{--prefix} option as part of the @code{configure} | |
4343 | command. In this example we symbolize this location with @code{<install>}. | |
4344 | ||
4345 | The documentation directory is @code{<install>/share/doc/gprofng} in case | |
4346 | @code{html} or @code{pdf} is selected and @code{<install>/share/info} for the | |
4347 | file in the @code{info} format. | |
4348 | ||
4349 | @noindent | |
4350 | Some things to note: | |
4351 | ||
4352 | @itemize | |
4353 | ||
4354 | @item | |
4355 | For the @code{pdf} file to be generated, the @code{texi2dvi} tool is required. | |
4356 | @cindex texi2dvi | |
4357 | It is for example available as part of the @code{texinfo-tex} package. | |
4358 | ||
4359 | @item | |
4360 | Instead of generating a single file in the @code{html} format, it is also | |
4361 | possible to create a directory with individual files for the various chapters. | |
4362 | To do so, remove the use of @code{--no-split} in variable @code{MAKEINFOHTML} | |
4363 | in the make file in the @code{<my-build-dir/gprofng/doc} directory. | |
4364 | ||
4365 | @end itemize | |
4366 | ||
4367 | @c -- An appendix ------------------------------------------------------------- | |
46c56757 VM |
4368 | @node The gprofng Man Pages |
4369 | @appendix The gprofng Man Pages | |
7a515757 VM |
4370 | @c ---------------------------------------------------------------------------- |
4371 | ||
4372 | In this appendix the man pages for the various @ProductName{} tools are listed. | |
4373 | ||
4374 | @c -- A new node -------------------------------------------------------------- | |
4375 | @c @node gprofng driver | |
46c56757 | 4376 | @node Man page for @command{gprofng} |
7a515757 VM |
4377 | @section Man page for @command{gprofng} |
4378 | @c ---------------------------------------------------------------------------- | |
4379 | ||
4380 | @include gprofng.texi | |
4381 | ||
4382 | @c -- A new node -------------------------------------------------------------- | |
4383 | @page | |
46c56757 | 4384 | @node Man page for @command{gprofng collect app} |
7a515757 VM |
4385 | @section Man page for @command{gprofng collect app} |
4386 | @c ---------------------------------------------------------------------------- | |
4387 | ||
4388 | @include gp-collect-app.texi | |
4389 | ||
4390 | @c -- A new node -------------------------------------------------------------- | |
4391 | @page | |
46c56757 | 4392 | @node Man page for @command{gprofng display text} |
7a515757 VM |
4393 | @section Man page for @command{gprofng display text} |
4394 | @c ---------------------------------------------------------------------------- | |
4395 | ||
4396 | @include gp-display-text.texi | |
4397 | ||
4398 | @c -- A new node -------------------------------------------------------------- | |
4399 | @page | |
46c56757 | 4400 | @node Man page for @command{gprofng display html} |
7a515757 VM |
4401 | @section Man page for @command{gprofng display html} |
4402 | @c ---------------------------------------------------------------------------- | |
4403 | ||
4404 | @include gp-display-html.texi | |
4405 | ||
4406 | @c -- A new node -------------------------------------------------------------- | |
4407 | @page | |
46c56757 | 4408 | @node Man page for @command{gprofng display src} |
7a515757 VM |
4409 | @section Man page for @command{gprofng display src} |
4410 | @c ---------------------------------------------------------------------------- | |
4411 | ||
4412 | @include gp-display-src.texi | |
4413 | ||
4414 | @c -- A new node -------------------------------------------------------------- | |
4415 | @page | |
46c56757 | 4416 | @node Man page for @command{gprofng archive} |
7a515757 VM |
4417 | @section Man page for @command{gprofng archive} |
4418 | @c ---------------------------------------------------------------------------- | |
4419 | ||
4420 | @include gp-archive.texi | |
4421 | ||
4422 | @ifnothtml | |
4423 | @node Index | |
4424 | @unnumbered Index | |
4425 | @printindex cp | |
4426 | @end ifnothtml | |
4427 | ||
4428 | @bye |