Profiling with perf and friends

Just a short tutorial on using perf and friends to figure out where to start with optimizations. Our example will be dmd compiling the release build of libphobos2.so.

First of all figuring out the command we’re interested in.

cd phobos
make -f posix.mak | grep -F libphobos2.so
../dmd/src/dmd -conf= -I../druntime/import  -w -dip25 -m64 -fPIC -O -release -shared -debuglib= -defaultlib= -ofgenerated/linux/release/64/libphobos2.so.0.73.0 -L-soname=libphobos2.so.0.73 ../druntime/generated/linux/release/64/libdruntime.so.a -L-ldl std/array.d std/ascii.d std/base64.d std/bigint.d std/bitmanip.d ...

perf-stat

A very good start to get a high-level overview is perf stat to obtain CPU event counts.

perf stat -r 5 -- ../dmd/src/dmd -conf= -I../druntime/import  -w ...
       2932.072376      task-clock (msec)         #    0.968 CPUs utilized            ( +-  0.34% )
                13      context-switches          #    0.004 K/sec                    ( +-  2.92% )
                 3      cpu-migrations            #    0.001 K/sec
           230,120      page-faults               #    0.078 M/sec                    ( +-  0.00% )
    10,942,586,352      cycles                    #    3.732 GHz                      ( +-  0.34% )  (34.19%)
    14,322,043,503      instructions              #    1.31  insn per cycle           ( +-  0.06% )  (50.00%)
     3,009,171,058      branches                  # 1026.295 M/sec                    ( +-  0.30% )  (32.70%)
        78,587,057      branch-misses             #    2.61% of all branches          ( +-  0.24% )  (30.76%)

       3.029178061 seconds time elapsed

It will already color numbers that are extremely off.

toplev

toplev is another great tool to get a more detailed and better understandable high-level overview.

./toplev.py --level 2 taskset -c 0 -- ../dmd/src/dmd -conf= -I../druntime/import  -w ...
C0    FE      Frontend_Bound:                             34.75 %           [  2.92%]
    This category represents slots fraction where the
    processor's Frontend undersupplies its Backend...
    Sampling events:  frontend_retired.latency_ge_8:pp
C0    FE      Frontend_Bound.Frontend_Latency:            24.38 %           [  2.92%]
    This metric represents slots fraction the CPU was stalled
    due to Frontend latency issues...
    Sampling events:  frontend_retired.latency_ge_16:pp frontend_retired.latency_ge_32:pp
C0    BAD     Bad_Speculation:                            14.05 %           [  2.92%]
C0    BAD     Bad_Speculation.Branch_Mispredicts:         13.65 %           [  2.92%]
    This metric represents slots fraction the CPU has wasted due
    to Branch Misprediction...
    Sampling events:  br_misp_retired.all_branches
C0-T0         MUX:                                         2.92 %
    PerfMon Event Multiplexing accuracy indicator
C1    FE      Frontend_Bound:                             42.02 %           [  2.92%]
C1    FE      Frontend_Bound.Frontend_Latency:            31.16 %           [  2.92%]
C1-T0         MUX:                                         2.92 %
C2    FE      Frontend_Bound:                             40.71 %           [  2.92%]
C2    FE      Frontend_Bound.Frontend_Latency:            34.68 %           [  2.92%]
C2    BAD     Bad_Speculation:                            10.23 %           [  2.92%]
C2    BAD     Bad_Speculation.Branch_Mispredicts:          9.66 %           [  2.92%]
C2    BE      Backend_Bound:                              35.74 %           [  2.92%]
C2    BE/Mem  Backend_Bound.Memory_Bound:                 21.60 %           [  2.92%]
    This metric represents slots fraction the Memory subsystem
    within the Backend was a bottleneck...
C2    RET     Retiring:                                   13.77 %           [  2.92%]
C2    RET     Retiring.Microcode_Sequencer:                8.49 %           [  5.84%]
    This metric represents slots fraction the CPU was retiring
    uops fetched by the Microcode Sequencer (MS) unit...
    Sampling events:  idq.ms_uops
C2-T0         MUX:                                         2.92 %
C3    FE      Frontend_Bound:                             36.71 %           [  2.92%]
C3    FE      Frontend_Bound.Frontend_Latency:            45.72 %           [  2.93%]
C3    BAD     Bad_Speculation:                            11.72 %           [  2.92%]
C3    BAD     Bad_Speculation.Branch_Mispredicts:         11.28 %           [  2.91%]
C3    BE      Backend_Bound:                              37.37 %           [  2.92%]
C3    BE/Mem  Backend_Bound.Memory_Bound:                 23.81 %           [  2.92%]
C3    RET     Retiring:                                   13.83 %           [  2.92%]
C3    RET     Retiring.Microcode_Sequencer:                8.74 %           [  5.84%]
C3-T0         MUX:                                         2.91 %
C0-T1         MUX:                                         2.92 %
C1-T1         MUX:                                         2.92 %
C2-T1         MUX:                                         2.92 %
C3-T1         MUX:                                         2.91 %

The level of detail can be selected using --level X (with X from 1-5), also see Selecting the right level and multiplexing, and it can record and plot events over time.

./toplev.py --level 3 taskset -c 0 -I 10 -o -x, x.csv -- ../dmd/src/dmd -conf= -I../druntime/import  -w ...
./tl-barplot.py x.csv --cpu C0-T0 -o toplev_dmd_barplot.png

perf-barplot

perf-record

perf record is the workhorse for drilling down into performance problems up to instruction level. The basic work-flow is recording events and then using the interactive perf-report to analyze them.

perf record -- ../dmd/src/dmd -conf= -I../druntime/import  -w ...
perf report

Another interesting mode is recording call-graphs.

perf record -g -- ../dmd/src/dmd -conf= -I../druntime/import  -w ...
perf report

It’s useful to play around with the --freq= option to collect more sample, and the --event= option to gather other events than the default cycles, e.g. branch-misses. Ask perf list for all available events. While neither of perf-record’s call-graph collection methods, frame pointers or DWARF backtraces, works for all of dmd, using frame pointers (perf record -g or perf record --call-graph fp instead of perf record --call-graph dwarf) captures most of it.

FlameGraph

The latest addition in my optimization toolbox is CPU Flame Graphs, a bunch of scripts to visualize profiles with call-graphs. After converting the profiler specific stacktraces (stackcollapse-perf.pl for perf), flamegraph.pl will generate an interactive svg file. We limit the stack depth to not kill the browser or svg viewer.

perf script --max-stack 24 | ./stackcollapse-perf.pl > profile.folded
# less profile.folded
./flamegraph.pl profile.folded > profile.svg

perf-flamegraph

blogroll

social