5. Callgrind: a heavyweight profiler

Table of Contents

5.1. Overview
5.2. Purpose
5.2.1. Profiling as part of Application Development
5.2.2. Profiling Tools
5.3. Usage
5.3.1. Basics
5.3.2. Multiple profiling dumps from one program run
5.3.3. Limiting the range of collected events
5.3.4. Avoiding cycles
5.4. Command line option reference
5.4.1. Miscellaneous options
5.4.2. Dump creation options
5.4.3. Activity options
5.4.4. Data collection options
5.4.5. Cost entity separation options
5.4.6. Cache simulation options

5.1. Overview

Callgrind is a Valgrind tool for profiling programs. The collected data consists of the number of instructions executed on a run, their relationship to source lines, and call relationship among functions together with call counts. Optionally, a cache simulator (similar to cachegrind) can produce further information about the memory access behavior of the application.

The profile data is written out to a file at program termination. For presentation of the data, and interactive control of the profiling, two command line tools are provided:

callgrind_annotate

This command reads in the profile data, and prints a sorted lists of functions, optionally with annotation.

For graphical visualization of the data, check out KCachegrind.

callgrind_control

This command enables you to interactively observe and control the status of currently running applications, without stopping the application. You can get statistics information, the current stack trace, and request zeroing of counters, and dumping of profiles data.

To use Callgrind, you must specify --tool=callgrind on the Valgrind command line or use the supplied script callgrind.

Callgrind's cache simulation is based on the Cachegrind tool of the Valgrind package. Read Cachegrind's documentation first; this page describes the features supported in addition to Cachegrind's features.

5.2. Purpose

5.2.1. Profiling as part of Application Development

With application development, a common step is to improve runtime performance. To not waste time on optimizing functions which are rarely used, one needs to know in which parts of the program most of the time is spent.

This is done with a technique called profiling. The program is run under control of a profiling tool, which gives the time distribution of executed functions in the run. After examination of the program's profile, it should be clear if and where optimization is useful. Afterwards, one should verify any runtime changes by another profile run.

5.2.2. Profiling Tools

Most widely known is the GCC profiling tool GProf: one needs to compile an application with the compiler option -pg. Running the program generates a file gmon.out, which can be transformed into human readable form with the command line tool gprof. A disadvantage here is the the need to recompile everything, and also the need to statically link the executable.

Another profiling tool is Cachegrind, part of Valgrind. It uses the processor emulation of Valgrind to run the executable, and catches all memory accesses, which are used to drive a cache simulator. The program does not need to be recompiled, it can use shared libraries and plugins, and the profile measurement doesn't influence the memory access behaviour. The trace includes the number of instruction/data memory accesses and 1st/2nd level cache misses, and relates it to source lines and functions of the run program. A disadvantage is the slowdown involved in the processor emulation, around 50 times slower.

Cachegrind can only deliver a flat profile. There is no call relationship among the functions of an application stored. Thus, inclusive costs, i.e. costs of a function including the cost of all functions called from there, cannot be calculated. Callgrind extends Cachegrind by including call relationship and exact event counts spent while doing a call.

Because Callgrind (and Cachegrind) is based on simulation, the slowdown due to processing the synthetic runtime events does not influence the results. See Usage for more details on the possibilities.

5.3. Usage

5.3.1. Basics

To start a profile run for a program, execute:

callgrind [callgrind options] your-program [program options]

While the simulation is running, you can observe execution with

callgrind_control -b

This will print out a current backtrace. To annotate the backtrace with event counts, run

callgrind_control -e -b

After program termination, a profile data file named callgrind.out.pid is generated with pid being the process ID of the execution of this profile run.

The data file contains information about the calls made in the program among the functions executed, together with events of type Instruction Read Accesses (Ir).

If you are additionally interested in measuring the cache behaviour of your program, use Callgrind with the option --simulate-cache=yes. This will further slow down the run approximately by a factor of 2.

If the program section you want to profile is somewhere in the middle of the run, it is beneficial to fast forward to this section without any profiling at all, and switch it on later. This is achieved by using --instr-atstart=no and interactively use callgrind_control -i on before the interesting code section is about to be executed.

If you want to be able to see assembler annotation, specify --dump-instr=yes. This will produce profile data at instruction granularity. Note that the resulting profile data can only be viewed with KCachegrind. For assembler annotation, it also is interesting to see more details of the control flow inside of functions, ie. (conditional) jumps. This will be collected by further specifying --collect-jumps==yes.

5.3.2. Multiple profiling dumps from one program run

Often, you aren't interested in time characteristics of a full program run, but only of a small part of it (e.g. execution of one algorithm). If there are multiple algorithms or one algorithm running with different input data, it's even useful to get different profile information for multiple parts of one program run.

Profile data files have names of the form

callgrind.out.pid.part-threadID

where pid is the PID of the running program, part is a number incremented on each dump (".part" is skipped for the dump at program termination), and threadID is a thread identification ("-threadID" is only used if you request dumps of individual threads with --separate-threads=yes).

There are different ways to generate multiple profile dumps while a program is running under Callgrind's supervision. Nevertheless, all methods trigger the same action, which is "dump all profile information since the last dump or program start, and zero cost counters afterwards". To allow for zeroing cost counters without dumping, there is a second action "zero all cost counters now". The different methods are:

  • Dump on program termination. This method is the standard way and doesn't need any special action from your side.

  • Spontaneous, interactive dumping. Use

    callgrind_control -d [hint [PID/Name]]

    to request the dumping of profile information of the supervised application with PID or Name. hint is an arbitrary string you can optionally specify to later be able to distinguish profile dumps. The control program will not terminate before the dump is completely written. Note that the application must be actively running for detection of the dump command. So, for a GUI application, resize the window or for a server send a request.

    If you are using KCachegrind for browsing of profile information, you can use the toolbar button Force dump. This will request a dump and trigger a reload after the dump is written.

  • Periodic dumping after execution of a specified number of basic blocks. For this, use the command line option --dump-every-bb=count.

  • Dumping at enter/leave of all functions whose name starts with funcprefix. Use the option --dump-before=funcprefix and --dump-after=funcprefix. To zero cost counters before entering a function, use --zero-before=funcprefix. The prefix method for specifying function names was choosen to ease the use with C++: you don't have to specify full signatures.

    You can specify these options multiple times for different function prefixes.

  • Program controlled dumping. Put

    #include <valgrind/callgrind.h>

    into your source and add CALLGRIND_DUMP_STATS; when you want a dump to happen. Use CALLGRIND_ZERO_STATS; to only zero cost centers.

    In Valgrind terminology, this method is called "Client requests". The given macros generate a special instruction pattern with no effect at all (i.e. a NOP). When run under Valgrind, the CPU simulation engine detects the special instruction pattern and triggers special actions like the ones described above.

If you are running a multi-threaded application and specify the command line option --separate-threads=yes, every thread will be profiled on its own and will create its own profile dump. Thus, the last two methods will only generate one dump of the currently running thread. With the other methods, you will get multiple dumps (one for each thread) on a dump request.

5.3.3. Limiting the range of collected events

For aggregating events (function enter/leave, instruction execution, memory access) into event numbers, first, the events must be recognizable by Callgrind, and second, the collection state must be switched on.

Event collection is only possible if instrumentation for program code is switched on. This is the default, but for faster execution (identical to valgrind --tool=none), it can be switched off until the program reaches a state in which you want to start collecting profiling data. Callgrind can start without instrumentation by specifying option --instr-atstart=no. Instrumentation can be switched on interactively with

callgrind_control -i on

and off by specifying "off" instead of "on". Furthermore, instrumentation state can be programatically changed with the macros CALLGRIND_START_INSTRUMENTATION; and CALLGRIND_STOP_INSTRUMENTATION;.

In addition to enabling instrumentation, you must also enable event collection for the parts of your program you are interested in. By default, event collection is enabled everywhere. You can limit collection to specific function(s) by using --toggle-collect=funcprefix. This will toggle the collection state on entering and leaving the specified functions. When this option is in effect, the default collection state at program start is "off". Only events happening while running inside of functions starting with funcprefix will be collected. Recursive calls of functions with funcprefix do not trigger any action.

It is important to note that with instrumentation switched off, the cache simulator cannot see any memory access events, and thus, any simulated cache state will be frozen and wrong without instrumentation. Therefore, to get useful cache events (hits/misses) after switching on instrumentation, the cache first must warm up, probably leading to many cold misses which would not have happened in reality. If you do not want to see these, start event collection a few million instructions after you have switched on instrumentation

.

5.3.4. Avoiding cycles

Each group of functions with any two of them happening to have a call chain from one to the other, is called a cycle. For example, with A calling B, B calling C, and C calling A, the three functions A,B,C build up one cycle.

If a call chain goes multiple times around inside of a cycle, with profiling, you can not distinguish event counts coming from the first round or the second. Thus, it makes no sense to attach any inclusive cost to a call among functions inside of one cycle. If "A > B" appears multiple times in a call chain, you have no way to partition the one big sum of all appearances of "A > B". Thus, for profile data presentation, all functions of a cycle are seen as one big virtual function.

Unfortunately, if you have an application using some callback mechanism (like any GUI program), or even with normal polymorphism (as in OO languages like C++), it's quite possible to get large cycles. As it is often impossible to say anything about performance behaviour inside of cycles, it is useful to introduce some mechanisms to avoid cycles in call graphs. This is done by treating the same function in different ways, depending on the current execution context, either by giving them different names, or by ignoring calls to functions.

There is an option to ignore calls to a function with --fn-skip=funcprefix. E.g., you usually do not want to see the trampoline functions in the PLT sections for calls to functions in shared libraries. You can see the difference if you profile with --skip-plt=no. If a call is ignored, cost events happening will be attached to the enclosing function.

If you have a recursive function, you can distinguish the first 10 recursion levels by specifying --fn-recursion10=funcprefix. Or for all functions with --fn-recursion=10, but this will give you much bigger profile data files. In the profile data, you will see the recursion levels of "func" as the different functions with names "func", "func'2", "func'3" and so on.

If you have call chains "A > B > C" and "A > C > B" in your program, you usually get a "false" cycle "B <> C". Use --fn-caller2=B --fn-caller2=C, and functions "B" and "C" will be treated as different functions depending on the direct caller. Using the apostrophe for appending this "context" to the function name, you get "A > B'A > C'B" and "A > C'A > B'C", and there will be no cycle. Use --fn-caller=3 to get a 2-caller dependency for all functions. Note that doing this will increase the size of profile data files.

5.4. Command line option reference

In the following, options are grouped into classes, in same order as the output as callgrind --help.

5.4.1. Miscellaneous options

--help

Show summary of options. This is a short version of this manual section.

--version

Show version of callgrind.

5.4.2. Dump creation options

These options influence the name and format of the profile data files.

--base=<prefix> [default: callgrind.out]

Specify the base name for the dump file names. To distinguish different profile runs of the same application, .<pid> is appended to the base dump file name with <pid> being the process ID of the profile run (with multiple dumps happening, the file name is modified further; see below).

This option is especially usefull if your application changes its working directory. Usually, the dump file is generated in the current working directory of the application at program termination. By giving an absolute path with the base specification, you can force a fixed directory for the dump files.

--dump-instr=<no|yes> [default: no]

This specifies that event counting should be performed at per-instruction granularity. This allows for assembler code annotation, but currently the results can only be shown with KCachegrind.

--dump-line=<no|yes> [default: yes]

This specifies that event counting should be performed at source line granularity. This allows source annotation for sources which are compiled with debug information ("-g").

--compress-strings=<no|yes> [default: yes]

This option influences the output format of the profile data. It specifies whether strings (file and function names) should be identified by numbers. This shrinks the file size, but makes it more difficult for humans to read (which is not recommand either way).

However, this currently has to be switched off if the files are to be read by callgrind_annotate!

--compress-pos=<no|yes> [default: yes]

This option influences the output format of the profile data. It specifies whether numerical positions are always specified as absolute values or are allowed to be relative to previous numbers. This shrinks the file size,

However, this currently has to be switched off if the files are to be read by callgrind_annotate!

--combine-dumps=<no|yes> [default: no]

When multiple profile data parts are to be generated, these parts are appended to the same output file if this option is set to "yes". Not recommand.

5.4.3. Activity options

These options specify when actions relating to event counts are to be executed. For interactive control use callgrind_control.

--dump-every-bb=<count> [default: 0, never]

Dump profile data every <count> basic blocks. Whether a dump is needed is only checked when Valgrinds internal scheduler is run. Therefore, the minimum setting useful is about 100000. The count is a 64-bit value to make long dump periods possible.

--dump-before=<prefix>

Dump when entering a function starting with <prefix>

--zero-before=<prefix>

Zero all costs when entering a function starting with <prefix>

--dump-after=<prefix>

Dump when leaving a function starting with <prefix>

5.4.4. Data collection options

These options specify when events are to be aggregated into event counts. Also see Limiting range of event collection.

--instr-atstart=<yes|no> [default: yes]

Specify if you want Callgrind to start simulation and profiling from the beginning of the program. When set to no, Callgrind will not be able to collect any information, including calls, but it will have at most a slowdown of around 4, which is the minimum Valgrind overhead. Instrumentation can be interactively switched on via callgrind_control -i on.

Note that the resulting call graph will most probably not contain main, but will contain all the functions executed after instrumentation was switched on. Instrumentation can also programatically switched on/off. See the Callgrind include file <callgrind.h> for the macro you have to use in your source code.

For cache simulation, results will be less accurate when switching on instrumentation later in the program run, as the simulator starts with an empty cache at that moment. Switch on event collection later to cope with this error.

--collect-atstart=<yes|no> [default: yes]

Specify whether event collection is switched on at beginning of the profile run.

To only look at parts of your program, you have two possibilities:

  1. Zero event counters before entering the program part you want to profile, and dump the event counters to a file after leaving that program part.

  2. Switch on/off collection state as needed to only see event counters happening while inside of the program part you want to profile.

The second option can be used if the program part you want to profile is called many times. Option 1, i.e. creating a lot of dumps is not practical here.

Collection state can be toggled at entry and exit of a given function with the option --toggle-collect. If you use this flag, collection state should be switched off at the beginning. Note that the specification of --toggle-collect implicitly sets --collect-state=no.

Collection state can be toggled also by using a Valgrind Client Request in your application. For this, include valgrind/callgrind.h and specify the macro CALLGRIND_TOGGLE_COLLECT at the needed positions. This only will have any effect if run under supervision of the Callgrind tool.

--toggle-collect=<prefix>

Toggle collection on entry/exit of a function whose name starts with <prefix>.

--collect-jumps=<no|yes> [default: no]

This specifies whether information for (conditional) jumps should be collected. As above, callgrind_annotate currently is not able to show you the data. You have to use KCachegrind to get jump arrows in the annotated code.

5.4.5. Cost entity separation options

These options specify how event counts should be attributed to execution contexts. More specifically, they specify e.g. if the recursion level or the call chain leading to a function should be accounted for, and whether the thread ID should be remembered. Also see Avoiding cycles.

--separate-threads=<no|yes> [default: no]

This option specifies whether profile data should be generated separately for every thread. If yes, the file names get "-threadID" appended.

--fn-recursion=<level> [default: 2]

Separate function recursions, maximal <level>. See Avoiding cycles.

--fn-caller=<callers> [default: 0]

Separate contexts by maximal <callers> functions in the call chain. See Avoiding cycles.

--skip-plt=<no|yes> [default: yes]

Ignore calls to/from PLT sections.

--fn-skip=<function>

Ignore calls to/from a given function. E.g. if you have a call chain A > B > C, and you specify function B to be ignored, you will only see A > C.

This is very convenient to skip functions handling callback behaviour. E.g. for the SIGNAL/SLOT mechanism in QT, you only want to see the function emitting a signal to call the slots connected to that signal. First, determine the real call chain to see the functions needed to be skipped, then use this option.

--fn-group<number>=<function>

Put a function into a separate group. This influences the context name for cycle avoidance. All functions inside of such a group are treated as being the same for context name building, which resembles the call chain leading to a context. By specifying function groups with this option, you can shorten the context name, as functions in the same group will not appear in sequence in the name.

--fn-recursion<number>=<function>

Separate <number> recursions for <function>. See Avoiding cycles.

--fn-caller<number>=<function>

Separate <number> callers for <function>. See Avoiding cycles.

5.4.6. Cache simulation options

--simulate-cache=<yes|no> [default: no]

Specify if you want to do full cache simulation. By default, only instruction read accesses will be profiled.