# Tony Albrecht

## Optimisation Lesson 1: Profiling

Posted on

The performance of a game is critical – it defines the scope of a game. It specifies how many, how far, how much detail, the level of freedom, even the length of the project. Whether you are an artist, designer, producer or programmer, the performance of a game has an impact on you. As Jeff Atwood recently stated, “Performance is a feature” and as with all features, they need to be scheduled, researched, implemented and maintained.

This article is the first in a series based on the Masterclasses which I’ve recently presented for Game Connection, and will walk you through several phases of optimisation performed on a simple piece of code. It is primarily aimed intermediate programmers  or experienced programmers who haven’t had much to do with performance and low level programming. The aim is to give a detailed example of finding performance bottlenecks and then addressing them. It should give you and idea of how code can be optimised and the magnitude of performance improvements you can expect from different changes. The series of lessons will cover profiling, branch removal, cache aware programming, loop unrolling and SIMD programming.

## Introducing A Simple Particle System

The first piece of code we’re going to look at is a very simple particle system. We will look at the update phase only, ignoring the rendering and GPU performance for this example considering only the CPU for now. The CPU in this case is the PlayStation®3′s PPU and external profiling is done with SN System’s excellent Tuner.

There are a couple of simple classes, the vector class  MyVector and the particle class, MyParticle.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 class MyVector { public: MyVector(){}; MyVector(float X, float Y, float Z) {x=X;y=Y;z=Z;}   inline const MyVector& operator+=( const MyVector& rhs ); float x,y,z; };   inline const MyVector &MyVector::operator+=( const MyVector& rhs ) { x += rhs.x; y += rhs.y; z += rhs.z; return *this; }   inline MyVector operator+( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z ); }   inline MyVector operator*( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x * rhs.x, lhs.y * rhs.y, lhs.z * rhs.z ); }   inline MyVector operator*( const MyVector& lhs, float rhs ) { return MyVector( lhs.x * rhs, lhs.y * rhs, lhs.z * rhs); } 

and the particle class;

 1 2 3 4 5 struct MyParticle { MyVector m_Position; MyVector m_Velocity; }; 

The particles in this system will accelerate under gravity and will bounce on the y equals zero plane with some damping applied to approximate energy lost in collision. If we can remember back to our high school maths, the equations of motion are;

${v}' = v + at$

${p}' = p + vt + \frac{1}{2} at^{2}$

So, the update loop for the code is:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 void ParticleManager::Update(float fClock) { int iParticle; const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0, -9.8, 0) * fClock; // a*dt   for (iParticle=0; iParticle<m_NumParticles; iParticle++) { MyParticle &p=m_Particles[iParticle]; // v' = v + a*dt p.m_Velocity += v3Acceleration;   // Update the particle position // p'=p + v * t + 0.5 * a * dt^2 p.m_Position += p.m_Velocity*fClock + v3Acceleration*fClock*0.5f;   // if Particle is below the ground plane, then damp and reverse the velocity if (p.m_Position.y<0) { p.m_Position.y*= -1; // reflect position along y p.m_Velocity= p.m_Velocity*dampVelocity; } } } 

This code gives us the following output (for 256k particles)

A simple particle system in action.

## Step 1: Profile

The first thing we need to do is figure how fast this code is running, and for that we need a useful metric. Frames per second is borderline useless and a percentage of a frame is just as bad. Use a sensible metric like milliseconds or microseconds – then you can then accurately measure the length of a function independent of the frame rate. Also, if possible, capture profiling information over a number of frames. This will allow you to find spikes and patterns in the execution of your code. When optimising code, the spikes are generally what need to be addressed first (and are often the easiest to fix too).

Consider the following graph of this particle system. The vertical axis is milliseconds taken to process all of the particles and the horizontal axis is time (# frames)

Graph of execution time taken over frame number.

Note the shape of the graph – it starts at around 17ms per frame and then dramatically sweeps up to sitting at about 60ms per frame. There is obviously something interesting happening here that requires further examination. Lets run a profiler over this code and see if we can get a better idea of what is happening with this example.

There are two main types of profiling that are generally used; instrumented and sampling.

### Instrumented Profiling

Instrumented profiling relies on either manual or automatic insertion of code that stores a time stamp at set points for later analysis. Manual instrumentation relies on the programmer inserting code which wraps the code to be timed, for example, at the start and ends of a function.

 73 74 75 76 77 78 79 80 81 void FunctionThatIsTakingALongTime() { PROFILE_START("Function Name"); for(int i=0; i<BigNumber;i++) { Do_Something_Expensive(slowly); } PROFILE_STOP(); } 

Manual instrumentation allows a programmer to store and use the timing information in any way that they see fit - on-screen profile bars, graphs or possibly profile data output to file for analysis in a separate application. It is the most versatile method of profiling, but care must be taken that the actual implementation of the profiling stubs aren’t too intrusive, slowing down the execution of the code being profiled. The graph above was produced using manual instrumentation.

Automatic instrumentation is performed by most modern profilers, and will allow runtime insertion of start/stop profiling stubs into code. This method is very useful, allowing a programmer to change which functions to instrument at run time, but relies on the profiling data being collected and displayed by an external application.

Another form of automatic instrumentation is call profiling which relies on the compiler inserting pre- and post-amble profiling code to every function. This method of profiling gives 100% coverage but is very intrusive (it slows down the running of the profiled code significantly).

With both types of instrumented profiling, the more instrumentation that occurs, the more intrusive it is and the slower the code will run. Instrumenting small, high frequency functions can artificially bloat the time taken in those functions and their parents, so care must be taken not to instrument too much.

Example of high frequency instrumentation slowing performance.

In the above diagram, each coloured bar corresponds to a function call, with the top ones calling the ones directly below them and so on, down to the fine grain leaf functions at the bottom. While this visual representation of the code path is useful for understanding the flow of code, it does slow down parts of the code non-uniformly and isn’t necessarily an accurate depiction of the relative performance of the profiled functions.

### Program Counter Sampling

This form of profiling requires an external profiler which samples the executing code’s program counter (PC) at set intervals over many frames, collecting these values for later analysis. The Program Counter is a CPU register which stores the current address of execution. Each PC value and the time that it was taken at can be reconstructed to determine which functions were being executed at which time. From this data we can see the overall execution footprint of the executing code – slow code will have more samples attributed to it while the fast code will have less.

PC Sampling Example in SN Tuner

In the above example, we can see that memory allocation is taking a long time and should be investigated further. The annotation has been added manually.

PC sampling, as with instrumentation, can be intrusive and affect the execution speed of the code. The higher the frequency of sampling, the more intrusive it is (although, in this case it has a uniform impact across the entire frame) , so it is advisable to minimise the sampling frequency and collect over many frames instead to get a more indicative performance measurement unless you explicitly need high frequency information for some reason.

### Profiling the Particle System

So, back to the problem at hand; we profiled the particle system using PC sampling and the profiler  attributed the samples to the following lines of code:

ParticleManager::Update() samples

The first column is the total number of samples at that PC, while the second column is the number of samples for that frame. Basically, the larger the number attributed to a line, the slower that line is to execute. The problem is that the PC samples correspond to particular assembly instructions and the application is responsible for determining which lines of C++ source those instructions are part of. This is a non-trivial problem (See the interactive example at the end of this lesson), as optimised code will generally move instructions around into an order which provides the best performance, so what we are left with is an approximation of where the slowest lines of code are. At the function granularity these samples are accurate, just not at the level of C++ instructions.

So, in order to accurately see what is taking the most time within a function, we must look at the disassembled instructions for this code fragment. I know what you’re thinking. You’re thinking

“Assembly!? I can’t read that. Only really old guys and freaks can read that!”

But don’t worry, it isn’t necessary to be fluent in the asm of your platform of choice, but it is important that you are able to understand the gist of the disassembled code. It’s like being in a foreign country – you don’t need be conversational (although that is ideal) but it will help immensely of you know some of the basics; the “please”, “thank you” and “Excuse me, where’s the bathroom?” – the phrases you use all the time.

For this lesson, I’ll not attempt to teach any assembly. I’ll leave that as an exercise for the reader  - it should suffice for this PPU based example to know that instructions starting with an ‘l‘ are load instructions, and those starting with ‘s‘ are stores. Instructions starting with an  ’f‘ are floating point operations – the full mnemonic is generally enough to guess the rest. If you want excruciating detail, the PPC books can be found here.

The next image shows the PC samples attributed to the individual assembly instructions. This tells us accurately which instructions are taking the longest to execute.

The numbers on the left are the samples attributed to that line of assembly, and this tells exactly what code we should be looking at. So, how do these instructions relate to the C++ code?  Well, I’ve coded up a simple interactive page which will allow you to step through the C++ of the inner loop and its generated assembly instructions interactively (as long as you have a HTML5 compliant browser that is – otherwise you’ll just get a blank page. In which case you need to go here). The C++ code being executed is highlighted on the right and the corresponding ASM on the left is shown so you can see what the code compiles into. Take note of the fact that there isn’t a direct linear relationship between the C++ and the ASM, the generated ASM can be quite out of order (even in this simple example). The blue box in the bottom right give a description of what is happening for that line.

HTML 5 compliant browser? Click here to step through the code

And so ends this first lesson. For the next lesson we’ll be looking at the hotspots in the generated assembly and figuring out why they are slow and how we can address them.

0

I've been a professional game developer since 2000, specialising in the hard core, low level, highly technical programming that is required to produce games that keep getting bigger and better. I love writing well specified, high performance code and rebuilding existing systems to function at the highest levels of performance. I take pride in understanding how the hardware works at the lowest levels so that I can eke out the best performance at the higher levels.