Playing around with perf

A few days ago I stumbled upon the Linux perf tool; this article stems from the curiosity about such tool.

Perf is a profiling tool for Linux: it counts the occurrences of several CPU’s and kernel’s events. Perf can be used in several modes; for example:

perf stat counts the events happened during the execution of a program;
perf record saves the counters of the events happened during the execution of a program;
perf report shows a report of the event counters previously saved with perf record;
perf top shows the event counters of the whole system in real-time.

We will use perf stat in order to acquaint ourselves with some of the events monitored by perf. We’ll proceed in an exploratory manner since each counter is a kind of Pandora’s box; since we’re at it, we’ll also use perf to compare a program that does the same task written in different programming languages.

Testing environment

Our test program will be the infamous recursive factorial! If you’ve landed in this page looking for an efficient factorial implementation you’re better off looking elsewhere.

The test program will calculate the factorial of the integer numbers from 1 to 20 and print it on the screen; this will be repeated for 10 times.

The results written in this article come from tests run on the following machine:

Intel i7-2670-QM CPU (4 core, 2.20GHz)
6GB of RAM
Linux 3.4.6 x86_64 (CONFIG_PREEMPT=y)

Pretty oversized for the following tests, indeed.

C

The first implementation of our test program is written in C. We will use the unsigned long long type: the C99 standard guarantees its size to be at least 64 bit. To print variables of that type we need to use the %lld modifier in the printf function format specifier.

Here’s the source code:

#include <stdio.h>

unsigned long long fact(unsigned long long n) {
    if (n <= 1) return 1;
    else return n * fact(n - 1);
}

int main(int argc, char* argv[]) {
    unsigned long long x, i;
    for (x = 0; x < 10; x++) {
        for (i = 1; i <= 20; i++) {
            printf("%lld\n", fact(i));
        }
    }
    return 0;
}

In the best C tradition, we must compile it:

$ gcc --std=c99 -O3 -o pippo pippo.c

And, finally, we can use perf:

$ perf stat ./pippo

The results:

Performance counter stats for './pippo':

 1.220809 task-clock                #    0.528 CPUs utilized          
      111 context-switches          #    0.091 M/sec                  
        0 CPU-migrations            #    0.000 K/sec                  
      128 page-faults               #    0.105 M/sec                  
1,562,075 cycles                    #    1.280 GHz                     [98.69%]
1,383,544 stalled-cycles-frontend   #   88.57% frontend cycles idle   
1,055,997 stalled-cycles-backend    #   67.60% backend  cycles idle   
1,764,632 instructions              #    1.13  insns per cycle        
                                    #    0.78  stalled cycles per insn
  314,855 branches                  #  257.907 M/sec                  
    8,144 branch-misses             #    2.59% of all branches         [23.56%]

       0.002313600 seconds time elapsed

Nice! And it’s only a few default counters: with perf list you can see all the goodies you can measure with perf. For now let’s just take a look at these default parameters:

task-clock

The program execution time, in milliseconds.

context-switches

When the Linux scheduler decides to run another program or when an interrupt triggers a routine’s execution, Linux steals a CPU from another running program (unless you got a free CPU). The “theft” involves saving the previous program CPU register’s content, in order to restore it later, and setting up them for the new program’s execution. This is context switching, and it happened for 111 times during the 1.2 milliseconds of our toy program’s execution!

CPU-migrations

On multi-processor systems Linux tries to keep the workload balanced among the available CPUs. Linux does this balancing both periodically and each time a CPU’s queue is empty. During our insignificant test there was no CPU migration.

page-faults

Each program has a virtual memory slice containing all of its instructions and data; usually it’s not efficient or even possible to copy all of it in RAM at once: a page fault is an event happening each time a program needs part of its virtual memory’s content to be copied in the physical memory. A page fault is signaled by the CPU via an interrupt (in Intel’s CPUs it’s the interrupt 14); such interrupt causes the execution of the page fault handling routine, that loads the desired content in the physical memory. When you run a program in Linux only the first part of its code is copied in RAM: the rest is brought into it on-demand with the page fault mechanism.

cycles

The quintessential CPU work, simplifying a bit, is:

fetch the next instruction to execute from the memory;
decode it;
run it.

This is known as the fetch-decode-execute cycle. Our test CPU runs at 2.20GHz, meaning that it can run this cycle 2.20 billion times per second. In practice, real CPUs split the fetch-decode-execution process in more than 3 phases. Thanks to techniques like pipelines, CPUs can run different phases of the fetch-decode-execute process for several instructions in a single cycle.

stalled-cycles

A CPU cycle is “stalled” when the pipeline doesn’t advance during it.

The Intel i7 pipeline is composed by many stages: the front-end is a group of these stages which is responsible for the fetch and decode phases, while the back-end executes the instructions. There is a buffer between front-end and back-end, so when the former is stalled the latter can still have some work to do.

Our toy program has kept the CPU idle for most of its execution time, shame on us!

instructions

This test caused the CPU to run 1.7 million of instructions in 1.5 million of cycles: thanks to pipelining we were able to run more than 1 instruction per cycle.

branches

Our test program contains a big showy branch: the recursive procedure checks for the base case every time.

branch-misses

Every time a program has a conditional instruction, the CPU tries to predict which side of the branch will be running even before evaluating the branching condition. This is done by statistical means and it’s useful to keep the pipeline filled: if the CPU guesses the correct branch, some of its instructions will already be halfway through the pipeline! This process, called “speculative execution”, was quite successful in our case: the branch predictor was wrong only 2.59% of the time, which was to be expected because our conditional instruction ended up almost always on the recursive branch.

Branch prediction can cause some strange bottlenecks, beware!

Python

Now we’ll compare the results of the C test with a Python implementation. Python 2 or 3? Both, of course!

The source code is pretty simple, we only have to take some care about the print incompatibility between Python 2 and 3:

from __future__ import print_function

def fact(n):
    if n <= 1:
        return 1
    else:
        return n * fact(n -1)

if __name__ == '__main__':
    for x in range(10):
        for i in range(1, 21):
            print(fact(i))

Now let’s profile it, with Python 2.7.3 and Python 3.2.3:

$ perf stat python2 pippo.py
$ perf stat python3 pippo.py

The results, for both the tests:

Performance counter stats for 'python2 pippo.py':

 23.599383 task-clock                #    0.884 CPUs utilized          
       404 context-switches          #    0.017 M/sec                  
         0 CPU-migrations            #    0.000 K/sec                  
     1,199 page-faults               #    0.051 M/sec                  
62,253,084 cycles                    #    2.638 GHz                     [77.12%]
24,284,879 stalled-cycles-frontend   #   39.01% frontend cycles idle    [85.56%]
18,800,600 stalled-cycles-backend    #   30.20% backend  cycles idle    [72.59%]
54,571,693 instructions              #    0.88  insns per cycle        
                                     #    0.45  stalled cycles per insn [86.29%]
12,050,167 branches                  #  510.614 M/sec                   [86.33%]
   450,304 branch-misses             #    3.74% of all branches         [83.27%]

0.026681088 seconds time elapsed

Performance counter stats for 'python3 pippo.py':

 51.270990 task-clock                #    0.495 CPUs utilized          
       413 context-switches          #    0.008 M/sec                  
         1 CPU-migrations            #    0.020 K/sec                  
     1,890 page-faults               #    0.037 M/sec                  
144,404,306 cycles                   #    2.816 GHz                     [76.52%]
69,657,055 stalled-cycles-frontend   #   48.24% frontend cycles idle    [82.24%]
46,733,342 stalled-cycles-backend    #   32.36% backend  cycles idle    [73.16%]
158,878,795 instructions             #    1.10  insns per cycle        
                                     #    0.44  stalled cycles per insn [87.22%]
38,254,589 branches                  #  746.125 M/sec                   [86.30%]
   945,615 branch-misses             #    2.47% of all branches         [85.28%]

0.103585508 seconds time elapsed

The test running on the CPython interpreter is significantly slower than the compiled C implementation; the additional overhead is evident both in CPU cycles terms and in page-faults terms (the interpreter executable is way bigger than our previous compiled executable).

There’s a lot happening under the hood of the interpreter: you can see it by the increased instruction numbers. At least the branch predictor was still pretty efficient. 🙂

Despite of the toy size of this example you can clearly see how the Python 3 implementation is twice as slower than Python 2. At least the Python 3 version seems to fill up the pipeline nicely (1.1 instructions per cycle), compared to Python 2 (less than 1 instruction per cycle).

Java

Let’s do the test with Java as well, since I happen to work daily with it.

public class Pippo {
    public static long fact(long n) {
        if (n <= 1) return 1;
        else return n * fact(n - 1);
    }

    public static void main(String[] args) {
        for (int x = 0; x < 100; x++) {
            for (long i = 1; i <= 20; i++) {
                System.out.println(fact(i));
            }
        }
    }
}

Yay, we got objects! 😛

Let’s compile and profile it (I’m using OpenJDK 1.7.0_05):

$ javac Pippo.java
$ perf stat java Pippo

The results are pretty interesting:

Performance counter stats for 'java Pippo':

166.823437 task-clock                 #    0.667 CPUs utilized          
     4,197 context-switches           #    0.025 M/sec                  
        46 CPU-migrations             #    0.276 K/sec                  
     4,086 page-faults                #    0.024 M/sec                  
451,866,791 cycles                    #    2.709 GHz                     [84.04%]
259,140,027 stalled-cycles-frontend   #   57.35% frontend cycles idle    [84.25%]
194,626,698 stalled-cycles-backend    #   43.07% backend  cycles idle    [66.68%]
361,661,446 instructions              #    0.80  insns per cycle        
                                      #    0.72  stalled cycles per insn [83.54%]
63,266,448 branches                   #  379.242 M/sec                   [81.27%]
 4,169,005 branch-misses              #    6.59% of all branches         [84.46%]

0.249978804 seconds time elapsed

In this case we’re even slower than Python. Look, the process was migrated between CPUs 46 times! The task was so trivial that probably most of the overhead was due to the infamous JVM start-up time.

Wrapping up

With perf you can profile every executable you want, even pre-compiled ones of course. You can also profile down to the function call level in order to find some candidates for optimizations. It’s really worth spending some time to get familiar with it.