Personal tools
You are here: Home cm Wiki Profiling Code
Views

History for Profiling Code

changed:
-
<center>
<h1>Profiling Code</h1>
</center>
<p>
To profile a program you can use one of the profiling utilities provided 
either as open source by GNU, or by the individual computer vendors. The 
choice of profiler depends on what platform you are on;
</p><p>

</p><center>
<table border="1" width="30%">
 <tbody><tr>
  <td align="left">
   <bold>IRIX</bold> (SGI)
  </td>
  <td align="left">
   <a href="#speedshop">speedshop</a>
  </td>
 </tr>
 <tr>
  <td align="left">
   <bold>AIX</bold> (IBM)
  </td>
  <td align="left">
   <a href="#gprof">gprof</a><br>
   <a href="#tprof">tprof</a>
  </td>
 </tr>
 <tr>
  <td align="left">
   <bold>Linux</bold>
  </td>
  <td align="left">
   <a href="#gprof">gprof</a><br>
  </td>
 </tr>
</tbody></table>
</center>

<p>
On the IBM (<em>hpc1</em> running AIX), <a href="#gprof"><em>gprof</em></a> 
is available for profiling single cpu programs. However, if you need to profile
a parallel program then you can't use <em>gprof</em> since the separate threads
will overwrite the <em>gmon.out</em> data file. To profile a parallel program 
you will have to use <a href="#tprof"><em>tprof</em></a> which has two main
drawbacks; it only works for 32 bit executables, and it has to be run as root.
</p><p>
On SGI's (such as O<sup>2</sup>s, Octanes, and the <em>hpc2</em> Onyx),
only the vendor supplied <a href="#speedshop"><em>speedshop</em></a> profiler
is available.
</p><p>
On Linux platforms only the <a href="#gprof"><em>gprof</em></a> profiler is
available, so it is probably best to use that.
</p><p>


</p><h2><a name="gprof">gprof</a></h2>
<p>
<em>gprof</em> is the GNU profiler which is freely available on most platforms.
The code is linked or compiled and linked using the <em>-pg</em> flag. The 
executable is then run, generating a file called <em>gmon.out</em>. 
<em>gprof</em> is then run on this file, giving statistics about the run of 
the program being tested.
</p><p>
<em>gprof</em> works by sampling the code at regular intervals (typically every
0.01 seconds) and calculating the time spent in each routine from this data. 
A fuller description of the run can be made by recompiling all the objects 
using the <em>-pg</em> option, in which case a calltree is also recorded, 
giving a record of what routines were called and in what order.
</p><p>

</p><h3>The Flat Profile</h3>
<p>
<em>gprof</em> can generate two main types of listing. The first is the 
<em>flat profile,</em> which can be generated with code that is linked using
the <em>-pg</em> option
</p><p>
</p><pre>	f77 -c test_f.f<br>	cc -c test_c.c<br>	f77 -pg -o test_obj test_f.o test_c.o<br></pre>
<p>
The code is run, generating a file called <em>gmon.out</em>, and this is 
processed using <em>gprof,</em>
</p><p>
</p>
<pre>	./test_obj<br>	gprof -p -b ./test_obj<br></pre>
<br><p><br>The flat profile shows the amount of time spent in each routine,<br></p><p><br></p><pre>	Flat profile:<br>	<br>	Each sample counts as 0.01 seconds.<br>	  %   cumulative   self              self     total           <br>	 time   seconds   seconds    calls  Ts/call  Ts/call  name    <br>	 50.00      0.75     0.75                             matrix_mult_<br>	 11.33      0.92     0.17                             pdaxpy_<br>	  8.67      1.05     0.13                             pddot_<br>	  8.00      1.17     0.12                             set_fv2d_equations_<br>	  6.67      1.27     0.10                             calc_residual_<br>	  4.00      1.33     0.06                             pdcopy_<br>	  4.00      1.39     0.06                             precon_solve_<br>	  3.33      1.44     0.05                             calc_mat_norm_<br>	  2.00      1.47     0.03                             solv_fix_bounds_<br>	  0.67      1.48     0.01                             conjgrad_sub_<br>	  0.67      1.49     0.01                             dlamc4_<br>	  0.67      1.50     0.01                             z_wnew<br></pre><br><p><br>Note that no calltree data is available if the code is only linked<br>with the <em>-pg</em> option. Also, only those routines which were running when<br>the object was sampled are shown. Thus you cannot rule out that a routine<br>was run, only that it was not run at any of the times that the program was <br>sampled.<br></p><p><br><br></p><h3>The Calltree Profile</h3><br><p><br>To get a calltree, and to show all routines that were run, you must both <br>compile and link with the <em>-pg</em> option.<br></p><p><br></p><pre>	f77 -pg -c test_f.f<br>	cc -pg -c test_c.c<br>	f77 -pg -o test_obj test_f.o test_c.o<br></pre><br><p><br>As before, the code is run, and the sample data is processed using <br><em>gprof</em>
</p><p><br></p><pre>	./test_obj<br>	gprof -p -b test_obj gmon.out<br></pre><br><p><br>The flat profile now shows all routines called.<br></p><p><br></p><pre>	Flat profile:<br>	<br>	Each sample counts as 0.01 seconds.<br>	  %   cumulative   self              self     total           <br>	 time   seconds   seconds    calls  ms/call  ms/call  name    <br>	 50.99      0.77     0.77       22    35.00    35.00  matrix_mult_<br>	 12.58      0.96     0.19                             pdaxpy_<br>	  8.61      1.09     0.13        1   130.00   130.00  set_fv2d_equations_<br>	  5.96      1.18     0.09        2    45.00    45.00  calc_residual_<br>	  5.30      1.26     0.08                             pddot_<br>	  3.97      1.32     0.06        2    30.00    30.00  calc_mat_norm_<br>	  3.97      1.38     0.06                             pdcopy_<br>	  1.99      1.41     0.03       22     1.36     1.36  precon_solve_<br>	  1.99      1.44     0.03        1    30.00    30.00  solv_fix_bounds_<br>	  1.99      1.47     0.03                             pdnrm2_<br>	  0.66      1.48     0.01        1    10.00   855.00  conjgrad_sub_<br>	  0.66      1.49     0.01                             dlamc4_<br>	  0.66      1.50     0.01                             dnrm2_<br>	  0.66      1.51     0.01                             pdload_<br>	  0.00      1.51     0.00       60     0.00     0.00  enters_<br>	  0.00      1.51     0.00       59     0.00     0.00  exits_<br>	  0.00      1.51     0.00       49     0.00     0.00  len_trim_<br><br>	... etc<br></pre><br><p><br>Note that the last three routines listed show zero % cpu. They were called<br>during the running of the program, but were never being run when the<br>profiler sampled the code.<br></p><p><br>To get a call graph try running <em>gprof</em> without the <em>-p</em> option,<br>or running with the <em>-q</em> option. The call graph generated shows the time<br>spent in each routine, the name of the parent routine (the routine that called <br>it), and the names and times spent in the child routines (those that the <br>routine called). <br></p><p><br></p><pre>	                        Call graph<br>  <br>  <br>	granularity: each sample hit covers 4 byte(s) for 0.66% of 1.51 seconds<br>  <br>	index % time    self  children    called     name<br>	                0.00    1.12       1/1           main [2]<br>	[1]     74.2    0.00    1.12       1         MAIN__ [1]<br>	                0.00    0.89       1/1           iter_solve_ [3]<br>	                0.13    0.00       1/1           set_fv2d_equations_ [8]<br>	                0.00    0.07       1/1           check_soln_ [11]<br>	                0.00    0.03       1/1           iter_factor_ [15]<br>	                0.00    0.00       5/10          allocate_memory_ [26]<br>	                0.00    0.00       3/3           timer_start_ [35]<br>	                0.00    0.00       3/3           timer_stop_ [36]<br>	                0.00    0.00       3/3           timer_read_busy_ [33]<br>	                0.00    0.00       3/3           timer_read_wall_ [34]<br>	                0.00    0.00       3/8           free_memory_ [29]<br>	                0.00    0.00       1/1           iter_alloc_ [38]<br>	-----------------------------------------------<br>	                                                 <spontaneous><br>	[2]     74.2    0.00    1.12                 main [2]<br>	                0.00    1.12       1/1           MAIN__ [1]<br>	-----------------------------------------------<br>	                0.00    0.89       1/1           MAIN__ [1]<br>	[3]     58.6    0.00    0.89       1         iter_solve_ [3]<br>	                0.00    0.85       1/1           conjgrad_ [4]<br>	                0.03    0.00       1/1           solv_fix_bounds_ [16]<br>	                0.00    0.00       1/60          enters_ [21]<br>	                0.00    0.00       1/59          exits_ [22]<br>	-----------------------------------------------<br>	                0.00    0.85       1/1           iter_solve_ [3]<br>	[4]     56.6    0.00    0.85       1         conjgrad_ [4]<br>	                0.01    0.84       1/1           conjgrad_sub_ [5]<br>	                0.00    0.00       5/10          allocate_memory_ [26]<br>	                0.00    0.00       5/8           free_memory_ [29]<br>	                0.00    0.00       1/60          enters_ [21]<br>	                0.00    0.00       1/59          exits_ [22]<br>	<br>	...etc<br></spontaneous></pre><br><p><br>To limit the profile to one routine and it's children use the <em>-f</em> <br>option,<br></p><p><br></p><pre>	gprof -f conjgrad_ test_obj gmon.out<br></pre><br><p><br>which starts the calltree at <em>conjgrad_,</em> and limits data to this <br>routines children.<br></p><p><br><br><br></p><h2><a name="speedshop">speedshop</a></h2><br><p><br><em>speedshop</em> is the profiler available under IRIX. This a brief <br>introduction to the system; for a fuller account the user is directed to the<br><a href="http://techpubs.sgi.com/library/tpl/cgi-bin/browse.cgi?coll=0650&amp;db=bks&amp;cmd=toc&amp;pth=/SGI_Developer/SShop_UG">SGI online manual</a>, also available<br>in <a href="http://techpubs.sgi.com/library/manuals/3000/007-3311-008/pdf/007-3311-008.pdf">PDF</a> format.<br></p><p><br><br>Unlike gprof, no recompilation or relinking is necessary. The user runs the<br>executable using the <em>ssrun</em> command, generating some data files<br>which are then processed using the <em>prof</em> command.<br></p><p><br><br>To give an example, the program is compiled normally,<br></p><p><br></p><pre>	f77 -c test.f<br>	f77 -o test_obj test.o<br></pre><br><p><br>and then run using the <em>ssrun</em> command,<br></p><p><br></p><pre>	ssrun -usertime ./test_obj<br></pre><br><p><br><br>This generates a data file called something like <br><em>test_obj.usertime.m7663628</em>, where <em>test_obj</em> is the name of the <br>executable, <em>usertime</em> is the name of the experiment type, and <br><em>m7663628</em> gives<br>the process id of the job, with the <em>m</em> prefix showing that the<br>file is from the master process (the process started by ssrun). Other <br>experiments include measuring <em>totaltime</em>, the physical or wall clock <br>time of the process, and <em>fpe</em>, to record floating point exceptions.<br></p><p><br><br>A profile of the run is then generated with <em>prof</em>,<br></p><p><br></p><pre>	prof ./test_obj test_obj.usertime.m7663628<br></pre><br><p><br>which generates a profile like<br></p><p><br></p><pre>-------------------------------------------------------------------------<br>SpeedShop profile listing generated Tue Nov  5 10:15:42 2002<br>   prof test_obj test_obj.usertime.m7663628<br>                test_obj (n32): Target program<br>                      usertime: Experiment name<br>                         ut:cu: Marching orders<br>               R14000 / R14010: CPU / FPU<br>                            16: Number of CPUs<br>                           500: Clock frequency (MHz.)<br>  Experiment notes--<br>          From file test_obj.usertime.m7663628:<br>        Caliper point 0 at target begin, PID 7663628<br>                        test_obj<br>        Caliper point 1 at exit(0)<br>-------------------------------------------------------------------------<br>Summary of statistical callstack sampling data (usertime)--<br>                            98: Total Samples<br>                             0: Samples with incomplete traceback<br>                         2.940: Accumulated Time (secs.)<br>                          30.0: Sample interval (msecs.)<br>-------------------------------------------------------------------------<br>Function list, in descending order by exclusive time<br>-------------------------------------------------------------------------<br> [index]  excl.secs excl.%   cum.%  incl.secs incl.%    samples  procedure  (dso: file, line)<br><br>     [6]      1.560  53.1%   53.1%      1.560  53.1%         52  MATRIX_MULT (test_obj: matrix.f, 986)<br>     [7]      0.300  10.2%   63.3%      0.300  10.2%         10  PDAXPY (test_obj: pdaxpy.f, 2)<br>     [8]      0.240   8.2%   71.4%      0.240   8.2%          8  SET_FV2D_EQUATIONS (test_obj: fd.f, 325)<br>     [9]      0.210   7.1%   78.6%      0.210   7.1%          7  CALC_RESIDUAL (test_obj: matrix.f, 919)<br>    [11]      0.150   5.1%   83.7%      0.150   5.1%          5  CALC_MAT_NORM (test_obj: matrix.f, 698)<br>    [12]      0.150   5.1%   88.8%      0.150   5.1%          5  PDCOPY (test_obj: pdcopy.f, 2)<br>    [14]      0.090   3.1%   91.8%      0.090   3.1%          3  SOLV_FIX_BOUNDS (test_obj: matrix.f, 482)<br>    [10]      0.090   3.1%   94.9%      0.180   6.1%          6  PRECON_SOLVE (test_obj: entry.F, 223)<br>    [15]      0.090   3.1%   98.0%      0.090   3.1%          3  PDDOT (test_obj: pddot.f, 2)<br>     [4]      0.060   2.0%  100.0%      2.370  80.6%         79  CONJGRAD_SUB (test_obj: cg.F, 55)<br>     [1]      0.000   0.0%  100.0%      2.940 100.0%         98  main (libftn.so: main.c, 76)<br>     [2]      0.000   0.0%  100.0%      2.940 100.0%         98  TEST_SOLVERS (test_obj: main.F, 2)<br>    [13]      0.000   0.0%  100.0%      0.150   5.1%          5  CHECK_SOLN (test_obj: matrix.f, 2)<br>    [16]      0.000   0.0%  100.0%      0.090   3.1%          3  ITER_FACTOR (test_obj: entry.F, 437)<br>     [3]      0.000   0.0%  100.0%      2.460  83.7%         82  ITER_SOLVE (test_obj: entry.F, 512)<br>     [5]      0.000   0.0%  100.0%      2.370  80.6%         79  CONJGRAD (test_obj: cg.F, 2)<br></pre><br><p><br><br>To get a calltree the <em>-b</em> argument can be passed to <em>prof</em>
</p><p><br></p><pre>	prof -b ./test_obj test_obj.usertime.m7663628<br></pre><br><p><br>which generates a profile like<br></p><p><br></p><pre>-------------------------------------------------------------------------<br>Butterfly function list, in descending order by inclusive time<br>-------------------------------------------------------------------------<br>         attrib.% attrib.time                         incl.time  caller (callsite) [index]<br> [index]   incl.%   incl.time    self%   self-time       procedure [index]<br>                              attrib.% attrib.time    incl.time  callee (callsite) [index]<br>-------------------------------------------------------------------------<br>     [1]   100.0%       2.940     0.0%       0.000         main [1]<br>                                100.0%       2.940        2.940  TEST_SOLVERS (@0x0adc9c20; libftn.so: main.c, 97) [2]<br>-------------------------------------------------------------------------<br>           100.0%       2.940                             2.940  main (@0x0adc9c20; libftn.so: main.c, 97) [1]<br>     [2]   100.0%       2.940     0.0%       0.000         TEST_SOLVERS [2]<br>                                 83.7%       2.460        2.460  ITER_SOLVE (@0x1000e040; test_obj: main.F, 1105) [3]<br>                                  8.2%       0.240        0.240  SET_FV2D_EQUATIONS (@0x1000d9e8; test_obj: main.F, 670) [8]<br>                                  5.1%       0.150        0.150  CHECK_SOLN (@0x1000e0e4; test_obj: main.F, 1123) [13]<br>                                  3.1%       0.090        0.090  ITER_FACTOR (@0x1000df38; test_obj: main.F, 1092) [16]<br>-------------------------------------------------------------------------<br>            83.7%       2.460                             2.940  TEST_SOLVERS (@0x1000e040; test_obj: main.F, 1105) [2]<br>     [3]    83.7%       2.460     0.0%       0.000         ITER_SOLVE [3]<br>                                 80.6%       2.370        2.370  CONJGRAD (@0x10024c34; test_obj: entry.F, 573) [5]<br>                                  3.1%       0.090        0.090  SOLV_FIX_BOUNDS (@0x10024a28; test_obj: entry.F, 540) [14]<br>-------------------------------------------------------------------------<br>            80.6%       2.370                             2.370  CONJGRAD (@0x10047cdc; test_obj: cg.F, 35) [5]<br>     [4]    80.6%       2.370     2.0%       0.060         CONJGRAD_SUB [4]<br>                                 53.1%       1.560        1.560  MATRIX_MULT (@0x10048438; test_obj: cg.F, 161) [6]<br>                                  6.1%       0.180        0.180  PRECON_SOLVE (@0x10048374; test_obj: cg.F, 141) [10]<br>                                  5.1%       0.150        0.300  PDAXPY (@0x10048504; test_obj: cg.F, 179) [7]<br>                                  4.1%       0.120        0.210  CALC_RESIDUAL (@0x1004828c; test_obj: cg.F, 135) [9]<br>                                  3.1%       0.090        0.300  PDAXPY (@0x100483ec; test_obj: cg.F, 156) [7]<br>                                  2.0%       0.060        0.150  PDCOPY (@0x10048408; test_obj: cg.F, 157) [12]<br>                                  2.0%       0.060        0.300  PDAXPY (@0x100484a4; test_obj: cg.F, 173) [7]<br>                                  2.0%       0.060        0.090  PDDOT (@0x10048460; test_obj: cg.F, 163) [15]<br>                                  1.0%       0.030        0.090  PDDOT (@0x100483a0; test_obj: cg.F, 146) [15]<br>-------------------------------------------------------------------------<br>            80.6%       2.370                             2.460  ITER_SOLVE (@0x10024c34; test_obj: entry.F, 573) [3]<br>     [5]    80.6%       2.370     0.0%       0.000         CONJGRAD [5]<br>                                 80.6%       2.370        2.370  CONJGRAD_SUB (@0x10047cdc; test_obj: cg.F, 35) [4]<br>-------------------------------------------------------------------------<br>... etc<br></pre><br><p><br><br>The <em>-usage</em> flag gives a summary of the resource usage of the program,<br></p><p><br></p><pre>	prof -usage ./test_obj test_obj.usertime.m7663628<br></pre><br><p><br>which generates a profile like<br></p><p><br></p><pre>-------------------------------------------------------------------------<br>Resource usage data<br>-------------------------------------------------------------------------<br>                           Real-time:     3.032084 (max.)<br><br>             Process Statistics (sum)       System-wide Statistics (max.)<br>             ========================       =============================<br>              user-time:     2.739388       total-user-cpu:    41.400000 <br>               sys-time:     0.188260        total-sys-cpu:     7.020000 <br>              wait-time:     0.104435       total-idle-cpu:     0.090000 <br><br>             accounting state timers<br>             -----------------------<br>               user-run:     2.739388<br>             system-run:     0.188270<br>          interrupt-run:     0.005423<br>       blocked-I/O-wait:     0.020920<br>            memory-wait:     0.000000<br>            select-wait:     0.000000<br>       job-control-wait:     0.000000<br>         run-queue-wait:     0.077886<br>             sleep-wait:     0.000320<br>    stream-monitor-wait:     0.000091<br>           raw-I/O-wait:     0.000000<br>                          -----------<br>                    SUM:     3.032300<br><br>                 summary statistics<br>                 ------------------<br>               bytes-read:          0<br>            bytes-written:        495<br>              page-faults:          2<br>             minor-faults:         45<br>    vol.-context-switches:         17<br>  invol.-context-switches:         25   total-context-switches:       2682<br>                  signals:         98         total-interrupts:       9581<br>            user-syscalls:        484           total-syscalls:     168372<br>            read-syscalls:          0      total-read-syscalls:       1006<br>           write-syscalls:         20     total-write-syscalls:        221<br>            poll-syscalls:          0<br>           ioctl-syscalls:          0<br>       process-size(max.):       6596<br>  resident-set-size(max.):       3212<br><br></pre><br><p><br><br>When profiling parallel programs run a <em>totaltime</em> experiment, with<br>the environment variable <em>MPC_GANG</em> set to <em>OFF</em>. A speedshop<br>file will be generated for each thread, the master thread being something <br>like <em>test_obj.usertime.m1111</em>, with the child threads having their<br>process ID's prefixed by a <em>p</em> rather than a <em>m</em>, <br><em>test_obj.usertime.p2222</em> for example. <br></p><p><br><br>Running <em>prof</em> on these output files with the <em>-overhead</em><br>flag gives the parallel overhead for the run,<br></p><p><br></p><pre>-------------------------------------------------------------------------<br>OpenMP Report:<br>-------------------------------------------------------------------------<br>                Parallelization Overhead: 00.000% excl.   00.000% incl.<br>                          Load Imbalance: 00.000% excl.   00.000% incl.<br>                Insufficient Parallelism: 16.596% excl.   16.596% incl.<br>                            Barrier Loss: 26.383% excl.   26.383% incl.<br>                    Synchronization Loss: 00.000% excl.   00.000% incl.<br>           Other Model-specific Overhead: 00.000% excl.   00.000% incl.<br></pre><br><p><br>Running <em>prof</em> on the file generated by the <em>master</em> thread <br>gives the sum of serial and parallel times, with the barrier losses for that <br>thread, whilst running <em>prof</em> on the files generated by the child threads<br>shows the insufficient parallelism, the time spent idle by the child threads.<br></p><p><br>If you wish to have a profile of where the program is spending its time, then<br>the profile of the master thread is most useful. An account of which routines<br>were running in parallel is found by examining the profiles of the parallel threads.<br></p><p><br><br><br></p><h2><a name="tprof">tprof</a></h2><br><p><br><em>tprof</em> is the threadsafe profiler provided by IBM for their<br>AIX platforms (such as hpc1). Unlike the AIX implementation of <em>gprof</em>,<br>it can be used to profile parallel programs. However it must be run as root,<br>and is only supported for 32 bit executables.<br></p><p><br>Unlike <em>gprof</em> the executable need not be relinked, but instead is<br>run via the <em>tprof</em> command.<br></p><p><br>To demonstrate, the program is compiled and linked as normal,<br></p><p><br></p><pre>	xlf -c test.f<br>	xlf -o test_obj test.o<br></pre><br><p><br>and then run <em>(as <bold>root!</bold>)</em> with <em>tprof</em>,<br></p><p><br></p><pre>	tprof -v -s -p ./test_obj -x ./test_obj<br></pre><br><p><br>The <em>-p "filename"</em> specifies the name of the process to be profiled, <br>the <em>-x "executable [executable options]"</em> is the command line of<br>the program to be run, and the <em>-s</em> flag specifies that <em>tprof</em><br>is to profile shared libraries as well as the code statically linked into <br>the executable.<br><br>Running the program through <em>tprof</em> generates a number of log files <br>prefixed by a double leading underscore, "__", and a file containing the <br>profile of the run called <em>__test_obj.all</em>. The data in this file <br>looks something like<br></p><p><br></p><pre>	        Process      PID      TID  Total Kernel   User Shared  Other<br>	        =======      ===      ===  ===== ======   ==== ======  =====<br>	     ./test_obj    65784   146411  22382    916  10235  11231      0<br>	     ./test_obj    65784   104533  22328    229  10329  11770      0<br>	        =======      ===      ===  ===== ======   ==== ======  =====<br>	          Total                    44710   1145  20564  23001      0<br>	<br>	<br>	          Process   FREQ  Total Kernel   User Shared  Other<br>	          =======    ===  ===== ======   ==== ======  =====<br>	       ./test_obj      2  44710   1145  20564  23001      0<br>	          =======    ===  ===== ======   ==== ======  =====<br>	            Total      2  44710   1145  20564  23001      0<br>	<br>	<br>	 Total System Ticks: 716480 (used to calculate function level CPU)<br>	<br>	<br>	   Total Ticks For ./test_obj (USER) = 20564 <br>	<br>	 Subroutine                Ticks  %   Source   Address Bytes<br>	 =============             ===== ==== ======== ======== ======<br>	 .calc_residual_@OL@9       6386  0.9 matrix.f 375cc    19c   <br>	 .matrix_mult_@OL@A         6243  0.9 matrix.f 6ff54    190   <br>	 .solv_fix_bounds_@OL@2     2794  0.4 matrix.f 65b90    238   <br>	 .conjgrad_sub_@OL@1        1954  0.3 cg.F     68664    d8    <br>	 .pdaxpy_@OL@1               828  0.1 pdaxpy.f 6ec0c    a8    <br>	 .pdcopy_@OL@1               734  0.1 pdcopy.f 37a80    9c    <br>	 .pddot_@OL@1                455  0.1 pddot.f  6f74c    108   <br>	 .precon_solve_@OL@1         428  0.1 entry.F  690c0    78    <br>	 .pdnrm2_@OL@1               369  0.1 pdnrm2.f 688c8    12c   <br>	 .idamax_                    117  0.0 glink.s  6771c    28    <br>	 .pdload_@OL@1               116  0.0 pdload.f a580     74    <br>	 .dasum_                     112  0.0 glink.s  35450    28    <br>	 ._xlsmpGetSLock               5  0.0 glink.s  35400    28    <br>	 .pddot_                       4  0.0 pddot.f  6f3a4    3a8   <br>	 .conjgrad_sub_                4  0.0 cg.F     67bb4    ab0   <br>	 .pdnrm2_                      4  0.0 pdnrm2.f 6873c    18c   <br>	 .solv_fix_bounds_             2  0.0 matrix.f 657a8    2a0   <br>	 .calc_residual_               2  0.0 matrix.f 37380    24c   <br>	 .lsame_                       1  0.0 lsame.c  42910    60    <br>	 .dlamch_                      1  0.0 dlamch.c 672bc    36c   <br>	 ._xlfReadUfmt                 1  0.0 glink.s  b417c    28    <br>	 .iter_solve_                  1  0.0 entry.F  64e68    940   <br>	 .conjgrad_                    1  0.0 cg.F     67744    470   <br>	 ._xlsmpParallelDoSetup_TPO    1  0.0 glink.s  a690     28    <br>	 .dlamc4_                      1  0.0 dlamch.c 66164    308   <br>	<br>	   Total Ticks For ./test_obj (SH-LIBs) = 23001 <br>	<br>	 Shared Object                            Ticks  %   Address  Bytes <br>	 =============                            ===== ==== ======== ======<br>	 /usr/lib/libesslsmp.a/essl.o             17792  2.5 d92bf100 8cd61c<br>	 /usr/lib/libxlsmp.a/smprt.o               2931  0.4 d92ae100 10376 <br>	 /usr/lib/libpthreads.a/shr_xpg5.o         2245  0.3 d0047000 2b4d5 <br>	 /usr/lib/libxlf90.a/io.o                    23  0.0 d313a100 4c95a <br>	 /usr/lib/libc.a/shr.o                       10  0.0 d01cb640 1ef281<br>	<br>	<br>	   Profile: /usr/lib/libesslsmp.a/essl.o<br>	<br>	   Total Ticks For ./test_obj (/usr/lib/libesslsmp.a/essl.o) = 17792 <br>	<br>	 Subroutine                Ticks  %   Source   Address Bytes<br>	<br>	 Subroutine                Ticks  %   Source   Address Bytes<br>	 =============             ===== ==== ======== ======== ======<br>	 .dasum                     9113  1.3 dasum.f  94c5ea8  1bf8  <br>	 .idamax                    5757  0.8 idamax.f 94c4eb8  5e0   <br>	 .cpuid                     1283  0.2 cpuid.c  92c253c  180   <br>	 ._restf22                   805  0.1 noname   92c272c  4     <br>	 .__xlf_omp_in_parallel_i4   227  0.0 glink.s  92c26bc  28    <br>	 .__xlf_omp_get_nested_i4    119  0.0 glink.s  92c26e4  28    <br>	 .Srestf28                    84  0.0 noname   92c2744  4     <br>	 ._savef22                    66  0.0 noname   92c24f8  4     <br>	 .dasum_                      60  0.0 esslqext.s 99f39c8  4     <br>	 .idamax_                     59  0.0 esslqext.s 99f39a0  4     <br>	 ._savef26                    53  0.0 noname   92c2508  4     <br>	 .Ssavef28                    47  0.0 noname   92c2510  4     <br>	 ._savef30                    43  0.0 noname   92c2518  4     <br>	 .$RESTF30                    40  0.0 noname   92c274c  4     <br>	 .$RESTF26                    36  0.0 noname   92c273c  4     <br><br>... etc<br></pre><br><p><br><br>The above profile is from a program parallelised using the OpenMP Fortran<br>compiler. Note that a number of routines are represented by separate <br>serial and parallel routines. Thus the Fortran subroutine <em>CALC_RESIDUAL</em><br>has a time for the serial component <em>calc_residual_</em> and for the<br>parallel loops <em>calc_residual_@OL@9</em>.<br><br></p><p><br><br>There are several things that bare keeping in mind about the output from<br><em>tprof</em>;<br></p><p><br></p><ul><br><li><br>The process profiles <bold>all</bold> programs running on the system <br>that have the name specified by the <em>-p "filename"</em> option. <br>Thus it pays to name your program something that will be unique. (For<br>example, if profiling the cmiss backend <em>cm</em>, if the executable <br>being run is named <em>cm</em>, then <em>all</em> cmiss processes currently <br>running will be profiled. To prevent this, rename<br>the executable being tested to something like <em>cm_joe.bloggs</em>).<br></li><li><br>The samples are given as a percentage of all tests made on all processors.<br>Thus, on a 32 processor machine, the number of ticks encountered by a <br>single thread will be 1/32 of the total system made, and this is the<br>percentage shown in the % column in the profile output file.<br></li><li><br>You have to run this as <bold>root</bold>. This means you will have to<br>persuade the System Admin to run the jobs for you.<br></li><li><br>Only 32 bit executables can be profiled. Profiling 64 bit executables <br>results in an empty output file.<br></li><li><br>Any shared libraries must be world readable. Otherwise they do not get<br>profiled. This is rather strange considering we are running as root ...<br></li><li><br>When running on cmiss we normally get a problem with the buffer overflowing.<br>I am not sure how to get around this one.<br></li></ul><br><p><br><br></p><hr><br></pre>