Linux Performance Analysis
Perf analysis is super interesting to me - why does an application run faster or slower under certain conditions? Why does one compiler (or compiler switch) produce a faster application than any other? I want to know what tricks my compiler is doing to speed up my app.
This post is an example performance analysis of an application called POV-Ray. I explain my benchmark choice in the section on POV-Ray.
Approaches
There are two ways I think about approaching performance analysis: a top-down approach and a bottom-up approach. I use perf for both of these approaches, so we’ll start with an overview of perf and then apply these approaches to povray.
Top-down approach: look at the application starting at the root of the call stack.
What does main()
look like? What is the application doing at an extremely high level?
Bottom-up approach: Look at the fine-grained details of the application. What instructions are being executed? Is the application memory-, network-, or compute-bound? Where are these instructions coming from in the source?
The Linux Perf Tool
So how do we see into the guts of this app as it’s running?
IMO the best place to start (and often finish) is with the perf
tool1.
Perf is a part of the linux project, so it’s supported on all linux platforms.
If you don’t already have it, you can probably install it from your package manager as linux-tools-common
:
sudo apt install linux-tools-common linux-tools-`uname -r`
Perf has lots of commands, but the main two you’ll need to interact with are perf-record
and perf-report
.
The workflow is generally:
; perf stat -- ./a.out
# This leaves the recorded data in ./perf.data
; perf record -- ./a.out
; perf report
Perf report helps you drill into the call stack to see where samples were recorded in the application, even down to the assembly instructions that corresponded to samples.
Perf Events and Perf List
Note that in the previous section I said perf report
helps you view
where samples were recorded and not where time was spent;
perf watches for events and takes periodic samples of what’s happening on the system when it wakes up.
These samples do not necessarily indicate where user-time is being spent.
Depending on your system, kernel configuration, and the configuration of perf itself, you’ll have different events available to profile.
Run perf list
2 to get a view of all the sampling events you can use on your system:
; perf list
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
...
The list of samplable events is rather long and often has architecture- and cpu-specific entries, so I’ll leave it as an excercise for the reader to see what perf events are available to you on your system, and learn what they all mean.
The -F
flag tells perf what observation frequency it should use when recording samples -
often -F 99
(for 99 hertz) is a good place to start; you get enough data to gain insights without being overwhelmed.
You can always turn it down for longer-running applications or when you’re sampling many different events.
Perf Stat
The best place to start with perf is often perf stat
.
This command gives a brief overview of total samples of events.
If something from perf stat’s report stands out, you can use perf record with that
event to drill into the sources of those samples.
A perf stat run might look like this:
; perf stat -- ./a.out
Performance counter stats for './a.out':
21,829.89 msec task-clock # 0.963 CPUs utilized
7,097 context-switches # 325.105 /sec
1 cpu-migrations # 0.046 /sec
5,062 page-faults # 231.884 /sec
70,001,621,188 cycles # 3.207 GHz
155,086,020,805 instructions # 2.22 insn per cycle
9,013,464,722 branches # 412.896 M/sec
49,795,347 branch-misses # 0.55% of all branches
22.661088635 seconds time elapsed
21.785643000 seconds user
0.051956000 seconds sys
Perf Record
perf record
is the primary command for recording samples about your application or system.
My perf record commands usually look like this:
; export \
APP=./a.out \
FREQ=99 \
EVENTS="cycles,instructions,branches,loads,task-clock"
; perf record \
--output perf-$APP.data \
--call-graph fp \
-F $FREQ -e $EVENTS \
-- taskset 0x2 ./a.out >/dev/null
I’m using --call-graph fp
because I want perf to record callgraph information
using the frame pointer - this is why you must often build your application with
the -fno-omit-frame-pointer
compiler flag (more on that later).
I’m also using taskset 0x2
because I only want the app to run on a single core
in this example; perf can also record data for everything running on your entire system
if you would like it to - or just on a specific core or for a specific application.
Perf Report
perf report
will give you a TUI report like this by default:
Samples: 88K of event 'cycles', Event count (approx.): 72137516526
Children Self Command Shared Object Symbol
+ 99.61% 0.00% povray libboost_thread.so.1.74.0 [.] 0x00007f61e2d6f0cb
+ 99.54% 0.00% povray povray [.] pov::Task::TaskThread
+ 97.41% 0.03% povray povray [.] pov::Trace::ComputeTextureColour
+ 97.40% 0.06% povray povray [.] pov::Trace::ComputeOneTextureColour
...
Notice the event used for the report is given in the first line.
perf report --stdio
gives the same information initially, but with all the call stacks expanded;
this may get overwhelming.
For a the 20 second recording I took for this example, the stdio output of
perf report was over 10k lines long:
; perf report --stdio|wc -l
10010
From inside the TUI you can press h
to get a list of all the available commands,
so I won’t enumerate them here.
I usually run perf report with the -G
flag, which is shorthand for --inverted
,
meaning the callgraph representation is inverted.
You may have noticed that the snippet from perf report I pasted above starts
with two columns: Self
and Children
.
The Children indicates the percentage of samples taken in that stack frame or any of its children - meaning any samples recorded while in this stack frame or that of any functions called from the current stack frame.
The Self column is more significant: it indicates what percentage of samples were taken in the given stack frame only - meaning instructions coming from that function alone, and not any functions it calls.
The main()
function is always at the top, since it calls all other function.
However, unless your entire program was inlined into the main routine, its Self
column is likely very low since most of the work being done is probably happening
elsewhere.
FlameGraph
I mention Brendan Gregg3 a few times in this post, and you should get familiar with him and his work. His blog has many pearls and he might have a one-liner for exactly your use case.
One of his other contributions is the FlameGraph repo4.
Remember how our perf report contains over 10k lines of reporting for just a single application running for ~20 seconds? His flamegraph repo gives us a way to visualize and gain insights from all of that data at a very high level by creating a flamegraph from perf’s recorded data.
The FlameGraph repo actually knows how to deal with other profilers too, like DTrace and SystemTap.
A workflow for generating a flamegraph might look like this:
# build and profile your application
; make
; perf record --call-graph fp -- ./a.out
; git clone https://github.com/brendangregg/FlameGraph ../FlameGraph
; perf script \
| ../FlameGraph/stackcollapse-perf.pl \
| ../FlameGraph/flamegraph.pl \
> flamegraph.svg
The FlameGraph scripts have actally been merged into the linux kernel’s repo, so perf built for a newer kernel has FlameGraph as a built-in script, used like so:
; perf script flamegraph -- ./a.out
# alternatively...
; perf record -- ./a.out
; perf script report flamegraph
This requires python scripting support built into perf, which my perf build does not have, so I can’t test it myself. I still use the scripts from Brendan’s repo.
POV-Ray
Povray5 is a 3d graphics code commonly used for benchmarking - it’s part of CPU benchmarking suites from OpenBenchmarking6 and spec20177, which means a few things:
-
It’s reasonably well-optimized.
Compiler writers and hardware vendors don’t care too much about benchmarking silly code that doesn’t represent what users will actually be running.
-
It’s cross-platform
Part of its utility is that we can compare performance across hardware vendors
-
It’s well-supported by most/all compilers
compiler authors and hardware vendors care about how well POV-Ray runs on their tech, so we can assume they’ve put effort into handling povray’s code well and ensuring it builds with their compilers.
-
It doesn’t rely too much on libraries.
OpenBenchmarking and SPEC suites are especially useful for benchmarking because they are mostly self-contained.
Building POV-Ray
POV-Ray is opensource, so we can download it and built it ourselves:
; git clone --branch latest-stable git@github.com:POV-Ray/povray.git
; cd povray
; (cd unix; ./preinstall.sh)
We will build the app with come debug information enabled so we have more visibility into the app’s behavior as it runs:
; ./configure \
--disable-strip \
--prefix=$PWD/../povray-gcc-12/ \
COMPILED_BY="Asher Mancinelli on $(date)" \
CFLAGS='-fno-omit-frame-pointer' CXXFLAGS='-fno-omit-frame-pointer' \
CC=gcc-12 CXX=g++-12
; ./unix/povray --version |& grep flags
Compiler flags: -pipe -Wno-multichar -Wno-write-strings -fno-enforce-eh-specs -Wno-non-template-friend -g -pg -O3 -ffast-math -march=native -fno-omit-frame-pointer
You’ll notice I used the unfortunately-named -fno-omit-frame-pointer
.
This tells the compiler to maintain the frame pointer in the frame pointer register (ebp
on x86_64 systems);
the compiler might otherwise reuse the register as a general-purpose register,
but we’re going to tell the perf tool to use the frame pointer register for building analyses,
so we need to keep it around.
Once we have the app built, we can run the standard benchmark (this takes a while):
; make -j `nproc` install
; ./unix/povray --benchmark </dev/null
...
Render Options
Quality: 9
Bounding boxes.......On Bounding threshold: 3
Antialiasing.........On (Method 1, Threshold 0.300, Depth 3, Jitter 0.30,
Gamma 2.50)
==== [Rendering...] ========================================================
Rendered 15360 of 262144 pixels (5%)
Further Reading
Truly, read the manpages.
The perf man pages could be more thorough and some commands are not exceptionally
well-documented (looking at you, perf diff
), but they are invaluable resources.
Search for Brendan Gregg on YouTube, he has plenty of great talks there. For example: Give me 15 minutes and I’ll change your view of Linux tracing