TOC

Linux Performance Analysis

Perf analysis is super interesting to me - why does an application run faster or slower under certain conditions? Why does one compiler (or compiler switch) produce a faster application than any other? I want to know what tricks my compiler is doing to speed up my app.

This post is an example performance analysis of an application called POV-Ray. I explain my benchmark choice in the section on POV-Ray.

Approaches

There are two ways I think about approaching performance analysis: a top-down approach and a bottom-up approach. I use perf for both of these approaches, so we’ll start with an overview of perf and then apply these approaches to povray.

Key Terms

Top-down approach: look at the application starting at the root of the call stack. What does main() look like? What is the application doing at an extremely high level?

Bottom-up approach: Look at the fine-grained details of the application. What instructions are being executed? Is the application memory-, network-, or compute-bound? Where are these instructions coming from in the source?

The Linux Perf Tool

So how do we see into the guts of this app as it’s running?

IMO the best place to start (and often finish) is with the perf tool¹. Perf is a part of the linux project, so it’s supported on all linux platforms.

If you don’t already have it, you can probably install it from your package manager as linux-tools-common:

sudo apt install linux-tools-common linux-tools-`uname -r`

Perf has lots of commands, but the main two you’ll need to interact with are perf-record and perf-report. The workflow is generally:

; perf stat -- ./a.out

# This leaves the recorded data in ./perf.data
; perf record -- ./a.out
; perf report

Perf report helps you drill into the call stack to see where samples were recorded in the application, even down to the assembly instructions that corresponded to samples.

Perf Events and Perf List

Note that in the previous section I said perf report helps you view where samples were recorded and not where time was spent; perf watches for events and takes periodic samples of what’s happening on the system when it wakes up. These samples do not necessarily indicate where user-time is being spent.

Depending on your system, kernel configuration, and the configuration of perf itself, you’ll have different events available to profile.

Run perf list² to get a view of all the sampling events you can use on your system:

; perf list
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
...

The list of samplable events is rather long and often has architecture- and cpu-specific entries, so I’ll leave it as an excercise for the reader to see what perf events are available to you on your system, and learn what they all mean.

The -F flag tells perf what observation frequency it should use when recording samples - often -F 99 (for 99 hertz) is a good place to start; you get enough data to gain insights without being overwhelmed. You can always turn it down for longer-running applications or when you’re sampling many different events.

Perf Stat

The best place to start with perf is often perf stat. This command gives a brief overview of total samples of events. If something from perf stat’s report stands out, you can use perf record with that event to drill into the sources of those samples.

A perf stat run might look like this:

; perf stat -- ./a.out
 Performance counter stats for './a.out':

         21,829.89 msec task-clock                #    0.963 CPUs utilized          
             7,097      context-switches          #  325.105 /sec                   
                 1      cpu-migrations            #    0.046 /sec                   
             5,062      page-faults               #  231.884 /sec                   
    70,001,621,188      cycles                    #    3.207 GHz                    
   155,086,020,805      instructions              #    2.22  insn per cycle         
     9,013,464,722      branches                  #  412.896 M/sec                  
        49,795,347      branch-misses             #    0.55% of all branches        

      22.661088635 seconds time elapsed

      21.785643000 seconds user
       0.051956000 seconds sys

Perf Record

perf record is the primary command for recording samples about your application or system.

My perf record commands usually look like this:

; export \
    APP=./a.out \
    FREQ=99 \
    EVENTS="cycles,instructions,branches,loads,task-clock"
; perf record \
    --output perf-$APP.data \
    --call-graph fp \
    -F $FREQ -e $EVENTS \
    -- taskset 0x2 ./a.out >/dev/null

I’m using --call-graph fp because I want perf to record callgraph information using the frame pointer - this is why you must often build your application with the -fno-omit-frame-pointer compiler flag (more on that later).

I’m also using taskset 0x2 because I only want the app to run on a single core in this example; perf can also record data for everything running on your entire system if you would like it to - or just on a specific core or for a specific application.

Perf Report

perf report will give you a TUI report like this by default:

Samples: 88K of event 'cycles', Event count (approx.): 72137516526
  Children      Self  Command  Shared Object              Symbol
+   99.61%     0.00%  povray   libboost_thread.so.1.74.0  [.] 0x00007f61e2d6f0cb
+   99.54%     0.00%  povray   povray                     [.] pov::Task::TaskThread
+   97.41%     0.03%  povray   povray                     [.] pov::Trace::ComputeTextureColour
+   97.40%     0.06%  povray   povray                     [.] pov::Trace::ComputeOneTextureColour
...

Notice the event used for the report is given in the first line.

perf report --stdio gives the same information initially, but with all the call stacks expanded; this may get overwhelming. For a the 20 second recording I took for this example, the stdio output of perf report was over 10k lines long:

; perf report --stdio|wc -l
10010

From inside the TUI you can press h to get a list of all the available commands, so I won’t enumerate them here.

I usually run perf report with the -G flag, which is shorthand for --inverted, meaning the callgraph representation is inverted.

You may have noticed that the snippet from perf report I pasted above starts with two columns: Self and Children.

The Self and Children columns

The Children indicates the percentage of samples taken in that stack frame or any of its children - meaning any samples recorded while in this stack frame or that of any functions called from the current stack frame.

The Self column is more significant: it indicates what percentage of samples were taken in the given stack frame only - meaning instructions coming from that function alone, and not any functions it calls.

The main() function is always at the top, since it calls all other function. However, unless your entire program was inlined into the main routine, its Self column is likely very low since most of the work being done is probably happening elsewhere.

FlameGraph

I mention Brendan Gregg³ a few times in this post, and you should get familiar with him and his work. His blog has many pearls and he might have a one-liner for exactly your use case.

One of his other contributions is the FlameGraph repo⁴.

Remember how our perf report contains over 10k lines of reporting for just a single application running for ~20 seconds? His flamegraph repo gives us a way to visualize and gain insights from all of that data at a very high level by creating a flamegraph from perf’s recorded data.

Note

The FlameGraph repo actually knows how to deal with other profilers too, like DTrace and SystemTap.

A workflow for generating a flamegraph might look like this:

# build and profile your application
; make
; perf record --call-graph fp -- ./a.out

; git clone https://github.com/brendangregg/FlameGraph ../FlameGraph

; perf script \
    | ../FlameGraph/stackcollapse-perf.pl \
    | ../FlameGraph/flamegraph.pl \
    > flamegraph.svg

Note

The FlameGraph scripts have actally been merged into the linux kernel’s repo, so perf built for a newer kernel has FlameGraph as a built-in script, used like so:

; perf script flamegraph -- ./a.out

# alternatively...
; perf record -- ./a.out
; perf script report flamegraph

This requires python scripting support built into perf, which my perf build does not have, so I can’t test it myself. I still use the scripts from Brendan’s repo.

POV-Ray

Povray⁵ is a 3d graphics code commonly used for benchmarking - it’s part of CPU benchmarking suites from OpenBenchmarking⁶ and spec2017⁷, which means a few things:

It’s reasonably well-optimized.

Compiler writers and hardware vendors don’t care too much about benchmarking silly code that doesn’t represent what users will actually be running.
It’s cross-platform

Part of its utility is that we can compare performance across hardware vendors
It’s well-supported by most/all compilers

compiler authors and hardware vendors care about how well POV-Ray runs on their tech, so we can assume they’ve put effort into handling povray’s code well and ensuring it builds with their compilers.
It doesn’t rely too much on libraries.

OpenBenchmarking and SPEC suites are especially useful for benchmarking because they are mostly self-contained.

Building POV-Ray

POV-Ray is opensource, so we can download it and built it ourselves:

; git clone --branch latest-stable git@github.com:POV-Ray/povray.git
; cd povray
; (cd unix; ./preinstall.sh)

We will build the app with come debug information enabled so we have more visibility into the app’s behavior as it runs:

; ./configure \
    --disable-strip \
    --prefix=$PWD/../povray-gcc-12/ \
    COMPILED_BY="Asher Mancinelli on $(date)" \
    CFLAGS='-fno-omit-frame-pointer' CXXFLAGS='-fno-omit-frame-pointer' \
    CC=gcc-12 CXX=g++-12
; ./unix/povray --version |& grep flags
  Compiler flags:      -pipe -Wno-multichar -Wno-write-strings -fno-enforce-eh-specs -Wno-non-template-friend -g -pg -O3 -ffast-math -march=native -fno-omit-frame-pointer

Frame Pointer

You’ll notice I used the unfortunately-named -fno-omit-frame-pointer. This tells the compiler to maintain the frame pointer in the frame pointer register (ebp on x86_64 systems); the compiler might otherwise reuse the register as a general-purpose register, but we’re going to tell the perf tool to use the frame pointer register for building analyses, so we need to keep it around.

Once we have the app built, we can run the standard benchmark (this takes a while):

; make -j `nproc` install
; ./unix/povray --benchmark </dev/null
...
Render Options
  Quality:  9
  Bounding boxes.......On   Bounding threshold: 3
  Antialiasing.........On  (Method 1, Threshold 0.300, Depth 3, Jitter 0.30,
 Gamma 2.50)
==== [Rendering...] ========================================================
Rendered 15360 of 262144 pixels (5%)

Notes