Just a short tutorial on using perf and friends to figure out where to start with optimizations. Our example will be dmd compiling the release build of libphobos2.so.
First of all figuring out the command we’re interested in.
cd phobos
make -f posix.mak | grep -F libphobos2.so
../dmd/src/dmd -conf= -I../druntime/import -w -dip25 -m64 -fPIC -O -release -shared -debuglib= -defaultlib= -ofgenerated/linux/release/64/libphobos2.so.0.73.0 -L-soname=libphobos2.so.0.73 ../druntime/generated/linux/release/64/libdruntime.so.a -L-ldl std/array.d std/ascii.d std/base64.d std/bigint.d std/bitmanip.d ...
perf-stat
A very good start to get a high-level overview is
perf stat
to obtain CPU event counts.
perf stat -r 5 -- ../dmd/src/dmd -conf= -I../druntime/import -w ...
2932.072376 task-clock (msec) # 0.968 CPUs utilized ( +- 0.34% )
13 context-switches # 0.004 K/sec ( +- 2.92% )
3 cpu-migrations # 0.001 K/sec
230,120 page-faults # 0.078 M/sec ( +- 0.00% )
10,942,586,352 cycles # 3.732 GHz ( +- 0.34% ) (34.19%)
14,322,043,503 instructions # 1.31 insn per cycle ( +- 0.06% ) (50.00%)
3,009,171,058 branches # 1026.295 M/sec ( +- 0.30% ) (32.70%)
78,587,057 branch-misses # 2.61% of all branches ( +- 0.24% ) (30.76%)
3.029178061 seconds time elapsed
It will already color numbers that are extremely off.
toplev
toplev is another great tool to get a more detailed and better understandable high-level overview.
./toplev.py --level 2 taskset -c 0 -- ../dmd/src/dmd -conf= -I../druntime/import -w ...
C0 FE Frontend_Bound: 34.75 % [ 2.92%]
This category represents slots fraction where the
processor's Frontend undersupplies its Backend...
Sampling events: frontend_retired.latency_ge_8:pp
C0 FE Frontend_Bound.Frontend_Latency: 24.38 % [ 2.92%]
This metric represents slots fraction the CPU was stalled
due to Frontend latency issues...
Sampling events: frontend_retired.latency_ge_16:pp frontend_retired.latency_ge_32:pp
C0 BAD Bad_Speculation: 14.05 % [ 2.92%]
C0 BAD Bad_Speculation.Branch_Mispredicts: 13.65 % [ 2.92%]
This metric represents slots fraction the CPU has wasted due
to Branch Misprediction...
Sampling events: br_misp_retired.all_branches
C0-T0 MUX: 2.92 %
PerfMon Event Multiplexing accuracy indicator
C1 FE Frontend_Bound: 42.02 % [ 2.92%]
C1 FE Frontend_Bound.Frontend_Latency: 31.16 % [ 2.92%]
C1-T0 MUX: 2.92 %
C2 FE Frontend_Bound: 40.71 % [ 2.92%]
C2 FE Frontend_Bound.Frontend_Latency: 34.68 % [ 2.92%]
C2 BAD Bad_Speculation: 10.23 % [ 2.92%]
C2 BAD Bad_Speculation.Branch_Mispredicts: 9.66 % [ 2.92%]
C2 BE Backend_Bound: 35.74 % [ 2.92%]
C2 BE/Mem Backend_Bound.Memory_Bound: 21.60 % [ 2.92%]
This metric represents slots fraction the Memory subsystem
within the Backend was a bottleneck...
C2 RET Retiring: 13.77 % [ 2.92%]
C2 RET Retiring.Microcode_Sequencer: 8.49 % [ 5.84%]
This metric represents slots fraction the CPU was retiring
uops fetched by the Microcode Sequencer (MS) unit...
Sampling events: idq.ms_uops
C2-T0 MUX: 2.92 %
C3 FE Frontend_Bound: 36.71 % [ 2.92%]
C3 FE Frontend_Bound.Frontend_Latency: 45.72 % [ 2.93%]
C3 BAD Bad_Speculation: 11.72 % [ 2.92%]
C3 BAD Bad_Speculation.Branch_Mispredicts: 11.28 % [ 2.91%]
C3 BE Backend_Bound: 37.37 % [ 2.92%]
C3 BE/Mem Backend_Bound.Memory_Bound: 23.81 % [ 2.92%]
C3 RET Retiring: 13.83 % [ 2.92%]
C3 RET Retiring.Microcode_Sequencer: 8.74 % [ 5.84%]
C3-T0 MUX: 2.91 %
C0-T1 MUX: 2.92 %
C1-T1 MUX: 2.92 %
C2-T1 MUX: 2.92 %
C3-T1 MUX: 2.91 %
The level of detail can be selected using --level X
(with X from 1-5), also see Selecting the right level and multiplexing, and it can record and plot events over time.
./toplev.py --level 3 taskset -c 0 -I 10 -o -x, x.csv -- ../dmd/src/dmd -conf= -I../druntime/import -w ...
./tl-barplot.py x.csv --cpu C0-T0 -o toplev_dmd_barplot.png
perf-record
perf record
is the workhorse for drilling down into performance problems up to
instruction level.
The basic work-flow is recording events and then using the interactive perf-report to analyze them.
perf record -- ../dmd/src/dmd -conf= -I../druntime/import -w ...
perf report
Another interesting mode is recording call-graphs.
perf record -g -- ../dmd/src/dmd -conf= -I../druntime/import -w ...
perf report
It’s useful to play around with the --freq=
option to collect more sample, and the --event=
option to gather other events than the default cycles
, e.g. branch-misses
. Ask perf list
for all available events.
While neither of perf-record’s call-graph collection methods, frame pointers or DWARF backtraces, works for all of dmd, using frame pointers (perf record -g
or perf record --call-graph fp
instead of perf record --call-graph dwarf
) captures most of it.
FlameGraph
The latest addition in my optimization toolbox is CPU Flame Graphs, a bunch of scripts to visualize profiles with call-graphs.
After converting the profiler specific stacktraces (stackcollapse-perf.pl
for perf), flamegraph.pl
will generate an interactive svg file.
We limit the stack depth to not kill the browser or svg viewer.
perf script --max-stack 24 | ./stackcollapse-perf.pl > profile.folded
# less profile.folded
./flamegraph.pl profile.folded > profile.svg