Overbyte - Programming Category http://www.overbyte.com.au/index.php/overbyte-blog/categories/listings/programming Thu, 24 May 2018 08:58:08 +0930 en-gb Vessel Post Mortem: Part 1 http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 I started looking at Vessel in January 2013 - initially just in my evenings. In my ‘spare’ time. I was looking at it because John and Martin (the founders of Strange Loop Games and the creators of Vessel) had asked me to consider porting it to PS3.  We knew each other  from our time together at Pandemic Studios in Brisbane where I had worked closely with Martin in the Engine team while John worked as an AI programmer. I was keen to do the port but I wanted to be as sure as possible that I knew how much work was involved in it before committing.

I found the idea of porting such a game daunting - it had a custom rendering engine, a bespoke fluid physics system with major parts of the game code written in Lua - but I figured it was time to step up and take on such a challenge. I have an immense amount of respect for John and Martin - they are both incredibly smart coders and I relished the chance of working with them again. I wanted to do it but I didn't want to let them down. They were going to be my clients, but they were also friends.

I started by looking at their code - in order to give a quote on the porting process I needed to have the game running so that I could profile it. Vessel had already been partially ported to the PS3 by a third party, but wasn't compiling when I first got my hands on it. So I spent a few weeks of long evenings fixing the FileIO, the template compilation errors (GCC and MSVC disagree on some of the finer points of templates), and just commenting out anything I didn't think was relevant in order to get the game up and running. Once it was working, I ran my trusty profiler over the game and tried to guess how much time it would take to get this baby to a shippable state. What I found was a little scary.

Visualisation of the code changes made during the porting of Vessel to PS3.

Vessel runs predominantly on three threads. The game thread which runs all the Lua scripts which control the majority of the game logic and in game objects, the render thread which takes the output of the game thread and converts it to a format that can be fed to the SPUs which then build the graphics command buffer and feed it to the GPU in parallel, plus the physics thread which manages all the physics processing - fluid dynamics, rigid body collisions, everything. John had built the physics system himself and it was pretty impressive (and complicated). It was too much to grok initially, so all I could do was look at the game’s performance on some of the worst levels, profile it there and see how much I needed to speed it up.

I investigated a few of the levels on the PS3 build and picked two that I thought were the worst. What they told me was that the physics thread was the primary cause of slowdown and so I concentrated my analysis on that thread. The game was running at about 60ms per frame in the worst levels I was testing - about 15 to 20 frames per second. From discussions with Strange Loop, I knew that the physics thread had to run at 60fps so this meant that I needed a factor of 4 speed up. From experience, I knew that this was something that was achievable. The game thread was running at around 30ms per frame and the Render thread was clocking in at around 5ms per frame. SPU rendering was pretty fast, and I figured that wouldn't be much of a problem.

Months of fannying about took place and a contract was finally signed in June. So I did the only appropriate thing one could do in that situation - I did two weeks of work on Vessel and then went on a 6 week family holiday to the UK.

In anticipation of the Vessel contract I’d hired an experienced programmer, Ben Driehuis. Ben came highly recommended and had worked on the Bioshock series of games, plus, most importantly, he was living in the same city as me. Having spent the last 6 years working remotely I knew that my best bet of delivering this game on time was to be in the same physical space as my team. I threw Ben in the deep end. He was hired and then given work on platforms and engines that he’d never had to deal with before. Then when I buggered off to the UK he had to manage by himself, dealing with a performance contract on a big name title and then kicking on with Vessel in my absence. Poor bugger.

I also had a friend who was interested in the QA side of things, so James (an old Uni mate) was responsible for playing Vessel over and over and over again, reporting bugs to us and telling us that the performance improvements we’d just made hadn't made any difference. James also helped us out on the testing for the TRC (Technical Requirements Checklist - a massive list of tests that need to be checked off in order to get your game through submission to Sony) and the actual submission of the game.

Once I got back from my holiday we moved into the new office and started working in earnest on the Vessel port. Our goal was to finish Vessel by the end of August - we had about four man-months to do the work, and I had carefully calculated how much time each section of the port would take. Surely that was enough time?

Ben and I divided up the work between us: I took on the optimisation of the physics and Ben was to deal with pretty much everything else - audio was broken, some of the in game objects didn't animate, loading needed to be optimised, memory usage had to be reduced, we needed save games and trophies and a whole swathe of PS3 specific features implemented. Ben has written about some of the fun he had during the port here, here and here.

From my initial inspection of the code, I had noticed a lot of memory allocations taking place, so I was optimistic that I could reduce (or remove) these and scrape back a lot of frame time. I also figured that rearranging the memory usage to be more cache friendly would help the PlayStation 3’s main processor so I could leave the core of the physics code alone. The last thing I wanted to do was to port all of the physics to SPU.

Three weeks later I started porting the physics to SPU. The changes I was making were incrementally improving the performance, but I could see that we’d never hit the speed we needed without dramatic changes. Porting code to SPU isn’t like porting code from one platform to another. You can program SPUs in C++, but they have very limited memory available to them - only 256kb each. So, in order to port any significantly complex task, you need to understand how that task uses memory - where it is, what order it reads it, where it writes it - the things that most programmers of high level languages take for granted. Not only that, but you have to consider how the task will parallelise. Running on one SPU is good, but running on 6 is awesome.  (I’ll go into more detail on the SPU porting process in later articles.)

The first SPU task I built was one that looked for drops that were near a set of objects. My first shot at this shaved almost 2ms off the frame time - the function in question dropped from a peak of 2.4ms to 0.6ms! That meant that in order to hit 16ms per frame I just needed to do this 21 more times!

Still, Ben and I kept hammering away at the code. By the end of the first week in August I’d shaved 15ms off the Physics thread and yet that thread remained the bottleneck. Ben had freed up over 200MB of memory and we were close to running on a retail PS3. But we’d started to notice a few things - some of the levers and triggers in the game weren’t working, fluid shaders weren’t working properly, we were getting spikes of over 200ms in some cases and the game still wasn’t playable due to not just the levers not working but the AI of the fluros seemed to be broken (some would teleport away).

It was evident at this point that we weren’t going to hit the end of August deadline. Other projects had taken some time from both Ben and myself, so we still had billable time available going into September, so I let Strange Loop know that things were going to take a little longer than I had expected. They were very understanding and comfortable with the release date pushing out a little more.

Things were progressing well though - by the end of August the physics thread was running at better than 33ms per frame - we were over halfway there! Save games were in place, loading was optimised, and trophies were functioning. Unfortunately, the game thread was slowing down - as Ben was fixing problems with the code and data, more things were being processed. For example, audio wasn’t working before and now there was a lot of audio being triggered so this obviously took more processing time.

It was the second week in September when I realised that I’d made a horribly flawed assumption about the game thread in Vessel. I discovered that it had to function in lock step with the physics thread and so it too had to run at 16ms per frame. Reducing the framerate of the physics wasn’t an option as it was fixed to 16ms per frame - experimentation with varying that resulted in fluid collisions breaking horribly.

Then Ben discovered the reason for the teleporting fluros - a compiler bug triggered only in optimised builds was causing fluros to incorrectly calculate their trajectory during jumps, making them disappear. Turning off optimisation for that function fixed the problem. That was a tricky bug to find, so we were pretty stoked - until we realised that this meant that the number of fluros in view was now higher that it was before. Much higher. More fluros means more processing - more physics, more drops, more rendering, more audio. This one fix resulted in the Physics thread jumping back up to 50ms from around 24ms, the render thread jumping up to 12ms and the Game thread climbing to 40ms.

I was horrified. We were effectively back to where we had started. We now needed to save a total of 60ms over three threads. I wasn't sure we could do it. It was like that bit in horror movies where the protagonist has killed the evil monster, only to have it rise up, stronger than it was before.

The initial optimisations on a project are the easiest. As time goes on and all the low hanging fruit has been picked, the remaining optimisations are harder and harder to perform. We really had our work cut out for us - and we'd come too far to go back now.

Visualisation of the code and data changes made for the Vessel PS3 port.

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Thu, 27 Feb 2014 23:11:19 +1030
Looking For a Good Sort http://www.overbyte.com.au/index.php/overbyte-blog/entry/looking-for-a-good-sort http://www.overbyte.com.au/index.php/overbyte-blog/entry/looking-for-a-good-sort Vessel for PlayStation 3 has finally been submitted to Sony for approval so I have a little more time to write up some of the optimisation challenges we had to overcome during its porting process. Well, OK, I don’t really have any more time but I’ll try to write a bit more about it before wet bit rot sets in and I forget everything.

Vessel ran predominantly on three threads; one for Physics, one for Game and one for the Render process. The Render thread translated the output from the Game thread into a command buffer that could be parsed by the SPUs and as such was pretty short - we had allocated up to 3 ms per frame for it and unfortunately, in certain places it was hitting 11ms.

Looking at a Tuner profile we found the following

In this case, the Render thread was taking around 6ms, with half of this taken up with the generation of the quads for each of the fluid mesh maps. Of that, 1 millisecond was spent in the STL sort function. This seemed a little excessive to me, so I looked closer. The red bars in the image above correspond to the following code:

with the following functors to do the gritty part of the sorting.

Basically, what the above code does is, for each fluid mesh map (the outer loop code isn’t shown but the first code fragment is called once per FluidMeshMap) sort the FluidVertexes by water type (lava, water, glow goo etc) and then by whether or not it is an isolated drop (both edges in the g_Mesh are equal) or if it is an edge (e1 != e2). Once that is done this information is used to populate a vertex array (via calls to SetEdge()) which is in turn eventually passed off to the GPU.

From looking at the Tuner scan we can see that the blue parts take up a fair amount of time (they correspond to PROFILE_SCOPE(sort) in the code) and correspond to the standard sort() function. The question is, can we write a better sort than std::sort?

std::sort is already highly optimised and was written by smart people that have produced a generic solution that anyone can plug into their code and use without thinking too much about it. And that is exactly why we can write a faster sort than std::sort for this case - because we know exactly what we want. We have the input and we know the output we want, all we need to do is work out the fiddly transformation bit in the middle.

The output we desire is the FluidVerts ordered by sort and then by edge/drop type. The final order should look something like this:

I decided to base the solution loosely on the Radix sort. We do a first pass over the data to be sorted and count up how many of each fluid ‘sort’ there is. This is then effectively a histogram of the different fluid types - we now know how many of each type there is and we can set up a destination array that is the size of the sorted elements and we can calculate where in that contiguous array the ‘sorts’ of fluid should go.

Our second pass steps through the source array and places the FluidVertex into the correct bucket (based on fluid ‘sort’), but adds it to the beginning of the bucket if it is an edge and at the end if it is a drop. And that’s it. Done. Two passes. A super fast, super simple sort that does just what we need.

“So, how much faster is it then?” I hear you ask.

“About five times faster.” I would respond. This sort runs in about .25ms compared to the original 1ms but it also has the added benefit of moving the FluidVertex data into contiguous sorted arrays (the original just produced an index map that could be used to traverse the original array (vSorted)). This is good because we can then easily optimise the second part of the FluidMeshMap loop which copies the verts to a vertex buffer for rendering and also modifies the edges via calls to SetEdge() (for those that are interested, the SetEdge() function does some simple arithmetic and inconveniently calls Atan2() to set up the angle of the edge for the GPU to use). This contiguous data means fewer cache misses, but on PS3 it also means that we can easily pass it off to the SPUs to do the transform and copy for us while we process the next FluidMeshMap on the PPU. Nice huh?

The result is that the BuildFluidMeshmap section is reduced from 2.25ms down to 0.24ms - a saving of 2ms

The key message to take away from this is that generic implementations of algorithms are great for getting your code up and running quickly, but don’t expect them to be lightning fast in all cases. Think about what you are doing, reduce you problem to the simplest inputs and outputs and then build a simple transformation that does just that. It will make you feel good on the inside. Simplicity is its own reward.

(For anyone interested in more sorting goodness, check out an old blog post of mine A Questions of Sorts.)

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Thu, 30 Jan 2014 12:13:26 +1030
Vessel: Common Performance Issues http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues It is time for me to write my final blog on the adventure that was, ‘Vessel PS3’, since my time at Overbyte is coming to an end. This project was tough, however, it is one I am proud we have conquered. There are a lot of interesting and cool optimisations that I myself have done whilst on this project, but since I won’t have the time to write a few blogs about them all, I thought I would highlight a few of the common things I found with some examples of things we did. I will leave some of the more interesting ones for Tony to cover later.

There were quite a few places in Vessel in which we had to fix up the use of vectors, lists and other similar containers. Obviously you need to use the right ones, to do the right job by understanding what your code needs your container to do and then picking the right one.

But what if you choose the right container or a near enough one? There are still many performance traps you can quickly run into.

Let’s do a quick example. What is wrong with the following code?

vector<PipeEffectDefinitionItem> defs = m_EffectDefinition.GetRef()->m_Definitions;

for(int i=0; i<m_vfxInstances.size(); ++i)

{

PipeEffectVfxReference& vfx = m_vfxInstances[i];

if(!vfx.m_vfxInstance.IsSet())

continue;

Did anyone notice the problem? We are grabbing a vector of these PipeEffectDefinitionItems from another object, but instead of using a reference, we are copying them. This happens in an update code of a particular class, which can be called a few times a frame due to other interactions. In each of them it was doing the same thing. Sometimes it is valid to copy a container, if you know the data might change or it is temporary. However in this case, we are referencing static data, so changing the first line to vector<PipeEffectDefinitionItem>& means we don’t need to allocate a whole lot of memory.

Memory allocation and de-allocation was a huge problem for us, with all the threads fighting each other. I have seen many code bases suffer from the same problems in my career. The things we found ourselves doing often in Vessel, was to attempt to work out the size of containers outside a loop, instead of just adding items to them inside it and causing lots of resizing of the containers.

Caching Data

Another common way to solve performance issues is to cache data. One of Vessel’s biggest function costs on the game thread, was a lookup which searched all persistable items in our game to see if it matched a HashID. The idea behind it being, that since some things are spawned at run time and some during load, they could safely reference each other and this system would look it up and let you access it, if it existed. This would be done throughout the code base for each frame and whilst the lookup itself was quite efficient, it was still expensive.

When dealing with a code base you don’t really know, the hardest thing is learning why things are done the way they are. Tony’s blog a few weeks ago talked about this. You cannot assume that the programmer before was stupid. They wrote the code the way it was for a reason. Sometimes they just wrote it in the most simple way and didn’t have the same needs/performance considerations you do. Sometimes the game/systems around it changed which means either a system is adjusted or the system is used in a way that it was not intended.

In this case it was clear that on PC the performance cost of looking it up each time was minimal to nothing, whereas the effort required to track object lifetimes, caching and removing cache, handling level transition and everything else would result in such an insignificant gain it was never looked at. For PS3 however we needed to cache these lookups! Here is the original call to GetRef.

And the cost of the function over a typical frame:

Caching this value required a few changes and a few assumptions. We knew that all things referenced by this system were all persistable, so we could add our required call back functions for both destruction of the persistable, as well as the removal of the reference to it on that class. We also assumed that the constructor would only be used to create new items, not used to re-assign items. This meant that we knew that we never had to check the cache in constructors. For all of our assumptions we had to write code to catch when we were wrong and adjust lots of other code around these. But in the end, what we were able to do is change GetRef to:

And now looking at its performance, we see a 0.5ms improvement.

Inefficient Data Searching

The first thing you should probably do in a code base when looking to optimise is to search for something like TODO: Make faster. Chances are it is probably slow. Whilst I am kidding, I did find one such comment after noticing a function using a lot of time via hierarchy view. This function was used to find both functions and properties so that lua and the game code could talk to one another. Here are the pieces of code in question.

There are two inefficient things in this code. Firstly we are using a vector to find the functors and properties, where we never need to iterate through the list, well at least not in any time critical, runtime code. Secondly we are doing a string comparison, ignoring case. These aren’t cheap either.  Here is a summary of a profile of this code running in one of our levels.

The solution here it two fold. Firstly we should be using an associated container as we know that

a)      The container is only added to as a start up cost, which sets up the links between the strings and the actual functors/properties.

b)      The time critical part of this function is a lookup based on a key

c)       We almost never do iteration through all elements (except to find the key as above)

Secondly we know that the string compare is not fast. In a case like this you want to convert it to a hash which will be a lot faster. The Vessel code base already uses hashes instead of strings in most places, but this one was never changed. So we now have the follow code:

The difference we can see below, a small saving, but the key thing in the max is reduced, so less peak slow downs. It is always good to review any functions which are called a lot, especially if they seem to be taking a lot of time in comparison to the work they should be doing.

Doing things unnecessarily

If you don’t understand a codebase, like in our case when we started Vessel, knowing what each function and system is trying to do and stopping it from doing things it doesn’t need to do can be extremely difficult. Now that we are at the end of the project, we are pretty familiar with the code and thus have found many areas where the code was doing things it didn’t have to, or being overly complicated.

For instance in Vessel there are objects called Water Instances. There is one water instance for each specification of liquid in the game. There could be several specifications of Water or Lava. The problem was the water instance code also managed sounds and lights for the liquids based on flow, collisions and other information. This code would find all liquid of a liquid type (Water, Lava, Red Goo) and play/stop sounds and generate lights for them. As we could have up to 9 water instances of a single type, this meant the same liquid was processed 9 times and 9 x lights and sounds were being played.

Obviously the issue here was that originally water instances and water types were a 1 to 1 relationship, but as the game grew, this moved apart and as there was no performance impact on PC, it wasn’t revisited. But on PS3 this was causing us huge issues, going through all the liquid in the game is slow, so to do it 9x more then necessary was bad. Also fmod was not appreciating the heavy load it was being put under and the GPU was struggling drawing 9 lights in the same location. Here is the profile:

We simply just cached if we had done the lighting and sounds checks for a liquid type each frame. If we had, we didn’t do it again. This did have some repercussions, as the content as all tweaked around this. So we had to increase sound volumes and some other settings, and the intensity of lights.

Further inside this class it was handling collisions and keeping a list of data about the interactions between drops of the fluid. This list was reusing a pool of data, to stop memory allocations and de-allocations. The list was checked each frame to remove any which were now obsolete and then checked for room to add new records later.

This code again was quite slow, as it was going through 100 records per 35 water instances. And most of the time, as most water instances do not have liquid on screen, there was no records to clean up. However the code did not know this and continued to check each empty record. Thus we did a minor optimisation to this, making the object know if it had any records being used. This way the code did nothing for the majority of the water instances.

This was a much smaller saving, but an indication of some of the real simple changes you can make to stop code doing things when it doesn’t need to. Let’s have a final look at the water collisions:

Anyway I hope there was some interesting information in there for you. Thankyou to everyone who took the time to read one of my blogs. Tony has heaps more in store for you, so watch this space!

]]>
ben@overbyte.com.au (Ben Driehuis) Programming Tue, 31 Dec 2013 09:33:04 +1030
Vessel: Where is the memory going? http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going Going into Vessel we (wrongly) assumed that memory usage would be not much of an issue, at least in terms of peak memory usage and getting it running on a console. However we soon learnt that the initial build of Vessel we had could not even load the main menu or any of our smallest levels without running out of memory on a retail PS3.

The first step is to find out what memory we are actually using and where is it going to. This might sound obvious and simple, but depending on your codebase and what engine and middleware you are using, there may not be any easy ways to find this out. With Vessel using its own custom engine written for a platform where memory isn’t a great concern, we had no tools to start with.

The PS3 has two areas of memory to use, 256mb of main system XDR memory and 256mb of RSX memory. This memory is completely separate from each other, with the Cell CPU being able to access the system memory fast, whilst the GPU accesses the RSX fast. Both processors could access the other area of memory but at a performance cost. Some good information on this can be found here, a few paragraphs down: http://www.playstationlifestyle.net/2013/02/27/ps4-memory-analyzed/

So we started trying to track down our usage. We used our RSX allocator to track current and peak memory usage and used sony api calls to track our system memory usage. What we soon learnt got us a little worried. We were way over memory on system (383mb used on our smallest level!) and running out of memory on RSX for most levels.

Middleware wise, we use FMOD, Lua, Boost and TinyXML. FMOD has some easy built in API to get memory usage and surprisingly I learnt that it was using 180mb just a minute into our gameplay… Thankfully this was a case of bad data, as the PS3 FMOD project had not been set to stream the music! This reduced it to a more reasonable 80mb, still high, but since all sounds which aren’t music are being loaded into memory, it was at least what we expected. So where is the other 200mb of system memory going?

Whilst we had no internal memory tracking tools, thankfully the PS3 does have a third party solution to tracking down memory called Heap Inspector. This solution developed by Jelle van der Beek and is free for PC and PS3 developers. Setting this up is quite straight forward, it just requires you to wrap your memory allocation and deallocation calls. An easy way to do this, if possible, is to use the link-time wrapping feature (--wrap). You can find the website for Heap Inspector here http://www.heapinspector.com/.

This gave us a very good rundown of exactly where the system memory was going. A big chunk of our memory was being used in the tinyxml system, which seemed strange. Since we were on quite an old version, we decided to upgrade (as there were both significant performance and memory improvements). As this involved a big api change I had to rewrite most of the xml interface code, which actually lead me straight to the major memory problem, after loading we were not clearing out the document handle properly, which meant the file buffer was being kept in memory. Since this was all raw xml, this was between 15-20mb per level, and our global data was another 30-40mb.

So now we can load most of our levels, but we couldn’t transfer between most of them. Further analysis with heap inspector showed us that fmod memory was increasing as the game went on and that a big chunk of our system memory was there to catch the rsx memory overflow. Again the fmod one wasn’t too hard to track down, originally Vessel didn’t have to worry about memory, so it never unloaded any sound data. Even if that sound was never played again. Whilst doing this after it finished playing would be bad and cost us performance (as when the sound is played again it would have to be loaded), on level transfer it was safe to clear our all the fmod event data. This meant it wouldn’t grow as the game went on.

Now we needed to solve the rsx overflow and looking at the tracking we had added ourselves, all of that overflow was due to texture data. Most games have texture streaming and a texture pool, but Vessel does not and just loads all level data. The levels are smallish though so we shouldn’t be using so much RSX! I soon discovered we were only overflowing on level transition and that’s when I worked out what was happening. On transition, we first load in the new level and data and then went through the old data and only unloaded anything that wasn’t reused. This is a nice optimisation for load times when you have memory to spare, but we didn’t have that luxury. Was there a way I could keep this in and not overrun our memory?

The good news is that the actual binary data and the data object were separated and thus what I did is change the system so it queued up the loading of the actual textures, but still loaded in the normal data objects and thus could still compare if we needed any of the texture and other asset data. Once we had unloaded all the not needed data, we then loaded in the new textures. Now we never went over RSX memory and we could get rid of our overflow!

The final memory issue we had was that sometimes loading our largest level we would run out of memory. This was because again it was using XML and the way the serialization system worked is that it read it exactly as an xml file. Thus we needed to allocate the file size buffer. Not always possible! So we had a few choices, change the serialization system so it could stream it in or find a way to reduce the file size.

Since TinyXML supported both compressed and binary xml data, which greatly reduces the size (as well as splitting the data over two files, instead of one large one), we decided to do that first. This reduced our largest level from needing a 21mb buffer to needed a 7mb and 3mb buffer. That was a lot more manageable and turned out to be more than enough for our needs. The nicer way to go would be to stream the data in and not need the buffer at all, however in our case we needed the fastest and least intrusive (since we need the PC version to work so we could access the editor).

So those were the major issues we had with memory usage. A lot of them are very specific to the issues we ran into, but the key things to remember are:

1. You need to find exactly where the memory is going
2. Use tools where possible, you don’t always have to re-invent the wheel
3. For middleware, check for API calls, check updates and talk to their support
4. Never ever trust the content!

]]>
ben@overbyte.com.au (Ben Driehuis) Programming Fri, 13 Dec 2013 10:18:00 +1030
How Smart Programmers Write Stupid Code http://www.overbyte.com.au/index.php/overbyte-blog/entry/how-smart-programmers-write-stupid-code http://www.overbyte.com.au/index.php/overbyte-blog/entry/how-smart-programmers-write-stupid-code This is article is based on the presentation I did at GCAP 2013. You can find the slides in Keynote and Powerpoint formats here. The slides include extensive presenter notes, so you should be able to understand what I was getting at even though the slides are purposely light on content.

I've spent a significant part of my professional career working with other people’s code. I've worked with code from big studios and small ones, from successful ones to struggling ones, with experienced devs and green. I’ve seen code that is beautiful, code that inspires as well as code that makes no sense, code that stinks, code that is impossible. I've seen code you wouldn't believe.

But all the code I’ve seen, both good and bad, has one thing in common - it was written the way it is for a reason. Part of my job, and the job of any code maintainer, is to find that reason and understand why the code was written in that particular way.

When confronted with new, complex code, there are 4 stages that a programmer goes through:

• [horror] WTF is this?
• [confusion] How did this ever work?
• [blame] I hate you and I wish you would die.
• [enlightenment] Ahh, I get it now.

The forth stage is the important one. Enlightenment. Comprehension. When you finally understand the code completely and it becomes your own. All too often if the function of some code isn't obvious, then it is often dismissed as being stupid - but this is rarely the case. The code in question isn't necessarily stupid, but it is obscure. Complex. Some of the smartest people I know are programmers, yet they can write some bloody awful code. Stupid coders don’t last long in this industry and their code doesn't last much longer than they do. So why do smart programmers write stupid code?

There are two ways of constructing a software design:
One way is to make it so simple that there are obviously no deficiencies
and the other way is to make it so complicated that there are no obvious deficiencies.

— C.A.R. Hoare, The 1980 ACM Turing Award Lecture

Time Constraints

Time constraints have a significant impact on the complexity of our codebases. There are two types of time constraints: the time available to write the code in the first place as well as the time available for the code to complete execution in.

Time to get the code running

The first constraint is a very common one - game devs are time poor. We generally have only just enough time to get the code working in the first place - who has time to revisit their code just to clean it up? We've got a game to ship, right? Personally, I’ve seen some awful game engines ship great games - and conversely, great engines that have never shipped a game. So, back to my point; given a limit in time to develop the code, what is often produced isn't always the cleanest or simplest solution. When that code is revisited for bug fixing or some other modification, its opacity can mean that it won’t be touched - rather it will be worked around or removed completely as the new coder may not understand why or how it has been done.

Time for the code to run in.

The second constraint is one we've all seen too. A portion of code was recognised to be a bottleneck and so it has been optimised into something unrecognisable. In its optimised state it is difficult to debug, difficult to add to, difficult to understand.

Both these forms of time constraint can result in complex, hard to maintain code. ‘Stupid’ code. Code that wasn't necessarily written by stupid people, but taken at face value is at best, obtuse.

Code Entropy

Code is like a thermodynamic system. Over time it tends to a state of disorder. In fact, given enough time it tends to a state of perfect internal disorder (which is a very apt description of most shipped games’ codebases). Disorder within a codebase manifests as complexity (and instability) which, if not actively resisted, increases over time.

There are a couple of factors which contribute to this increase of complexity;

Problem Domain EvolutionThe problem you end up solving is rarely the one you set out to solve.

Consider a programmer who sets out to solve a problem. She settles on an appropriate approach an initial solution is implemented. At a later date the initial problem is deemed to be incorrect or requires incremental modifications and so you she has to readdress the problem. The new direction is similar enough to the old one that it makes sense to simply modify the existing code - so she slowly twists the original solution to fit the new problem. It works, it’s a little ugly, but it works. If this happens enough times, the final problem can be significantly different to the original one and the initial design may not be as appropriate any more. Or even remotely appropriate. But it works.

The issue here is the evolution of the code. The programmer has designed a clean system to solve the given problem and that problem has changed so she needed to modify, or dirty, the system. Edge cases and singularities can pop up which can dirty the implementation as well.

So what you have is an evolving system which gets more and more complicated as the problem domain changes. This is even more evident over longer periods of time leaving you with dead code and data. As new programmers are swapped in and out to iterate on the code things get progressively worse.

Solution Evolution

Iterating without readdressing the initial design convolutes the eventual solution.

A programmer starts with a given problem. She designs a system to solve this problem and then implements that solution and tests it to see if it solves the problem. If it does, then the problem is solved, the code is checked in and beer is consumed in a celebratory manner. If not, then the implementation is revisited and iterated upon until a working solution has been achieved (possibly while everyone else is at the pub having beers).

The issue here is that the initial design may not have been the most appropriate for this problem - the programmer learns more about the problem as she implements and iterates on it, but rarely revisits the initial approach to the problem. The more work that is put into a particular approach, the less likely the programmer is to change it. The ideal time to design the code for this problem is once she’s finished implementing and testing it.

What we end up with is sort of like DNA - there is a working solution there, but there is also a lot of cruft left in there from other iterations which has no effect on execution (except for when it does). It’s hard to tell what is part of the core solution and what is cruft - so what happens is nothing is touched when changes need to be made. New code is added to the periphery of the solution, often duplicating what is already in that original solution.

Complexity has momentum

The problem with complexity is that it breeds more complexity. The more complex a solution is, the more complex modifications to it become, either through necessity (the complex implementation requires complex changes) or convenience (it’s too hard to understand so its easier to tack something onto the existing solution or even to rewrite what is there).

Controlling complexity is the essence of computer programming.
— Brian Kernighan

It takes a real effort to simplify a complex implementation - a coder needs to completely understand the existing solution before she can simplify it. The urge to rewrite a complex system is strong, but unless you fully understand the existing implementation you run the risk of making the same mistakes the original programmer(s) did.

Any intelligent fool can make things bigger, more complex, and more violent.
It takes a touch of genius — and a lot of courage — to move in the opposite direction.

-Einstein

In closing

My recommendations are this; Treat existing code and previous coders with respect. In all likelihood it is the way it is for a valid reason. Once you understand why the code was written the way it was, you are then in a position to modify, simplify or optimise it - and when you do, leave the code in a state which is better than when you found it. Your future self with thank you.

Where possible, take the time to refactor your code once it is working. Simplify it, clean it, make it beautiful. Beautiful code is its own reward.

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Thu, 21 Nov 2013 21:16:57 +1030
Vessel: Lua Optimization and Memory Usage http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage http://www.overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage Hi everyone, my name is Ben Driehuis and I am a senior programmer at Overbyte.

Vessel uses Lua to do it's glue gameplay code, allowing it to use most of the systems and tech built up throughout to form specific gameplay related classes. When we first looked into the performance of Lua and Vessel in general, we noticed very quickly that all of the threads were doing a lot of memory allocation and frees each frame which meant there was actually a lot of contention between the threads.

Lua has its own incremental garbage collector and is performed automatically. The collector runs in small steps and run time is more or less based on the amount of memory allocated. Basically the more memory you are assigning in Lua, the more the garbage collector will run. You can also manually run the garbage collector in many ways, forcing a step or doing a full run (which is not cheap). The full run through is very handy for the cases where the user won’t notice an extra 10-30ms in a frame - a good example of this are at the start/end of a level load or when showing a pause menu or some other game pausing interface.

On console memory is always a huge factor, with most optimizations involving some trade-off between memory and performance. However in a lot of cases the trade-off is heavily skewed one way or the other. In vessel our garbage collection time was quite busy. We were forcing one step of the GC per frame, after our gameplay thread had done the work needed for the physics thread to do its thing.  The result of that in this particular scene was:

An average time of 2.6ms is pretty good, but the maximum of 7.4ms is a huge concern! As we learnt above, the step time is very dependent on the amount of new memory being used each frame in Lua. In addition to the manual GC we have Lua doing its normal background incremental garbage collection during execution. From the below you can see it working away during the game thread, indicated by the calls to propagatemark (See Tony's last article, A Profiling Primer, for more info on Tuner)

Looking at our markers we quickly noticed that our processor Lua scripts were also quite expensive in this scene (the two pink bars in the above image). This marker is the main place where our gameplay world objects which change each frame exist. Thus it is also the part where the most Lua is called. Let’s have a look at the script processor update performance currently. Looking at the image below, we can see it has an average of 7.1 ms. That is not good at all!

We then started investigating what short term allocations we were making inside of Lua to try and reduce our need to do the garbage collection. Our aim was to turn off the automatic background garbage collection and just run our step at the end of each frame, with that step not taking too much time. We used third party code from Matt Ellison, who is Technical Director at Big Ant Studios to track down and display the memory we were allocating and freeing each frame. A big thankyou to Matt for allowing us to use this.

So, in this scene we determined we had about 30kb per frame in temporary Lua allocations. That’s a lot and will quickly add up! Looking at the output from Matt's tool, we saw that 20kb of that allocation came from just one Lua script, Bellows.lua,  and all of that inside its SetPercent function. The output showed a lot of temporary Vector2 construction, including a few kb from our Utils.lua VectorRotate function called inside it.

Let’s take a quick look at those two functions.

Well I think we can now see where all that Vector2 construction is coming from! Inside that for loop, we are creating three Vector2’s per iteration and then the call to VectorRotate in the loop creates another one. So that’s five per loop through on each bellow.

We can also see here there are some unnecessary calculations. Bellow.GetRotation + this.OffsetRotation is done twice without either of those value changing in each loop. We also know that GetRotation is a call into native code which is more expensive then accessing a Lua member. Material Scale is also calculated in each loop, even though it never changes value. So instead of creating all of these Vector2’s inside the for loop, lets use local variables and remove the call to the VectorRotate. This gives us:

A good performance improvement on average and looking at the temporary memory use of Lua we now saw a maximum of 23kb a frame, a 7kb saving and an average of 20kb. Our output was now not showing any Vector2 allocations, we had reduced our memory waste by just under 25% and, as we hoped, our Lua was running faster.

We still have some temp memory each frame created for each bellow and ring which we can make into a local for the function. We can also store the reference to the maths functions as locals. This will give us a small speed up in performance.

Why is this? Because Lua allows each function to use up to 250 of its own “registers”, which means Lua stores all local variables into these registers. For non-locals, it first has to get the data/function and put it into a local and then use it. If you are calling or creating something once in a function, obviously making it a local will not help as you have the one time lookup cost. But in a loop, you are saving each subsequent call! So let’s run with the following code:

A huge performance win! Memory wise though we aren’t really seeing any big improvement, still using a maximum of 22kb of temporary memory a frame and an average of 19kb. At this point I decided to find other areas heavy in bellow usage to see if this would assist me in narrowing down where all the waste is.

This brought me to the start of one our levels, where the camera zooms out over a largish area. This was showing an average of 20kb a frame with a maximum of a massive 38kb! And 17kb of that was again the Bellows. The data showed lots of memory use by calls to GetMaterial and GetLocalRect. Performance wise for our ScriptOnProcessorUpdate, we were pretty close to the above image in terms of average, at 4.33ms, but with a higher maximum of a massive 8.6ms and lower minimum at 2ms.

Looking at the code we now have we can see two interesting facts regarding the use of GetMaterial and GetLocalRect. Firstly both are called on unique items once in the loop, so we can’t cache it. However both are used to then call a function on them. That’s two calls into our native code and a temporary allocation of an object pointer and a Rect. So what happens if we make two new functions in native code to let us do it in one call, without the temporary assignment?

Another good win due to less calls between Lua and the native code and less temporary allocation. We now have an average of 14kb of temporary memory usage per frame and a maximum of 20kb for this new scene. A whole lot better! Our bellows here are now allocating a maximum of 9kb.

We can also remove the offsetVector, as we know the X component is always 0 and so we can cut down the position calculations, as we know doing mathcos(bellowOffSetRotation) * offsetVector.x will give us a result of 0, so the mathcos is a waste of time. Thus those position calculations can be simplified to:

position.x = posBase.x - (mathsin(bellowOffSetRotation) * posOffset)
position.y = posBase.y + (mathcos(bellowOffSetRotation) * posOffset)

This doesn’t really save us much temporary memory as we are just saving on one float, but it is still a slight help and the performance is slightly better. At this stage we are getting pretty close to an optimised function without changing things significantly. So let’s take a relook at the performance of the garbage collection step at the end of each frame now.

So our average is down by 0.5ms and the maximum is lower too after dealing with just one of the lua classes. Add that to the 3.5ms saving we got in the processor update itself, which some of it will also be due to garbage collection savings as we are still running automatic background GC, that’s a huge saving on the game thread.

There are a lot more areas like this in the code base where we should get similar wins and will greatly reduce our lua memory churn. The golden rules to remember are;

• use locals when you use something more than once a function
• try to cache of data from native code if you use it more than once
• make condensed functions to again reduce the calls to native code. Calling one native function which does all the work is much better then calling several native functions to do the same thing.

The next step is to turn off the background garbage collection and only run it at the end of the frame for a certain amount of ms. This will require us to keep an eye on the memory usage of lua and ensure we are doing enough cleaning up to mean we can get away with limiting the garbage collection time.

There is more information about Lua performance in the article titled Lua Performance Tips by Roberto Lerusalimschy (http://www.lua.org/gems/sample.pdf).

Once again, a huge thanks to Matt Ellison and Big Ant Studios for letting us use their Lua memory tracer.

]]>
ben@overbyte.com.au (Ben Driehuis) Programming Wed, 02 Oct 2013 08:55:00 +0930
A Profiling Primer http://www.overbyte.com.au/index.php/overbyte-blog/entry/a-profiling-primer http://www.overbyte.com.au/index.php/overbyte-blog/entry/a-profiling-primer The most interesting thing about this port is the optimisation feat ahead of us. Vessel was already (mostly) running on the PS3 at the start, but needed to run much, much faster. So, where does one begin? The first thing you need to do, is profile it - if you don't know where the program is slow (or why) then you can't optimise it. But in order for me to be able to talk about profiling and optimisation, I'll need to delve (briefly) into the profiling tools on the PS3 (If you're an experienced PS3 dev, then your probably familiar with this information - skip ahead, I won't mind). Here's a screen capture of part of the tool I use for PlayStation3 profiling (SN System's excellent Tuner - comfortably the best profiler I've used)

Like all the best profilers, it has lots of colourful bars (the PS2 Performance Analyser was awesome like that too). Colourful bars are crucial for dazzling and confusing artists who may wander by and happen to glance at your monitor. The top of the screen has a time line in seconds - you can see it ends at 0.073, or 73ms. Yup, that's right, 73ms. (Note that in this blog I will almost always use milliseconds rather than frame per second. Frames per second is a useless metric only used by teenage boys who like to put lights inside their PCs).

The display is broken up into roughly 3 parts that we can see here. Each group of horizontal bars corresponds to a thread (click on the image to see it in higher resolution in another window). The top thread is the Physics thread, the next the Game thread and then the Render thread.

The top, segmented bar is the overall thread usage. When the thread isn't running (like at the start during the stall for sync) you'll see a gap or a bar indicating an OS function (like a mutex or even fileIO). The colourful collection of bars beneath that are the result of marking up important functions in the code base with calls that tell the profiler when the function starts and stops. The longer the bar, the more time it takes. Bars that appear under other bars indicate function calls from that parent. For example, WaterTime calls MoveWaterFromInternalPressure which calls ApplyViscosity and the functions that call DoubleDensityRelaxation and then ProcessSkeletons. It is a very powerful way to view the flow of your code and I often find myself using Tuner just to see how a foreign codebase flows. For those of you out there who don't have access to the PS3 devkits or Tuner, you can use Chrome Tracing to read the output of your profiling in JSON format and display it in your browser (assuming of course, that your browser is Chrome).

So, basically, the longer the bar the longer the function it represents takes to run. Now, how do we choose which bar to optimise?

Generally, when looking for what to optimise first, I try and find something easy, something obvious. Occasionally you'll get lucky and make a simple fix that will save milliseconds - check your compilation options (you wouldn't believe how often people miss one library/project), check that your code is inlining properly (check the generated assembly), look for lots of string based functions as well as intra frame allocations - basically make sure that everything you're seeing makes sense and that there are no piles of stupid lying around. There are two ideal candidates for optimisation: One function that does a lot of processing or one or more small functions that are called a lot.

There is another feature of Tuner (and most other good profilers) and that is Program Counter sampling. Program Counter (PC) sampling is done by the profiler reading the program counter register to get an indication of which line of code is being executed at that time. The more times a particular line is sampled, the more often it is being executed - on PC the free Very Sleepy does a good job of this. You'll either get a hierarchical view of the functions being called, or a flat view of what is being called when - like in the following image:

The red and blue marks correspond to samples that are performed during the frame. On the left you can see the names of the functions that the samples have occurred in. In Tuner you can also drill down into the actual functions and see which lines and even which assembly instructions have samples attributed to them. This allows you to find hotspots - sections of the code that are called a lot and contribute heavily to the cost of each frame.

Tuner has a lot of additional functionality that I won't go into here - this should be enough to help you to understand captures that myself and Ben will post over the upcoming weeks.

In my next post I'll be looking in detail at the optimisation of one of the sections of the Physics thread. I'll just say this; my initial hope was that I could gain a lot of performance back by simply rearranging memory and minimising memory allocation in the physics code. That hope was very quickly and resoundingly dashed - this wasn't going to be an easy task.

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Fri, 27 Sep 2013 09:15:00 +0930
Any Port in a Storm http://www.overbyte.com.au/index.php/overbyte-blog/entry/any-port-in-a-storm http://www.overbyte.com.au/index.php/overbyte-blog/entry/any-port-in-a-storm The Overbyte team has been busy beavering away on a number of projects over the last few months, but there is one that I thought might be of interest to a few programmers out there. We recently started work on a PlayStation 3 port of the successful Indie PC title, Vessel, and what we plan to do over the coming weeks is to document the porting process. We'll look at what we started with, what our goals were, what assumptions we made, what mistakes we made along the way and have a look at some really gritty optimisation problems.

]]>
tony@overbyte.com.au (Tony Albrecht) Programming Sat, 14 Sep 2013 22:21:10 +0930
Optimisation Lesson 3: The Memory Bottleneck http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck

Welcome to Lesson 3 of the Overbyte Optimisation series.

If you cast your mind back to the end of lesson 2, we were left with the following code and corresponding samples:

(Note that the samples here are not exactly the same as the ones in the previous lesson, even though the code is. This is due to a change in the version of the compiler being used – we now have faster, more interesting code)

From looking at the source, there is nothing obviously expensive in the calculations at the highlighted points – no branches, square roots, divides or other non-pipelined instructions. So, there must be something other than instruction latency causing the performance problem.

Lets have a look at the disassembled code and see if that sheds any light:

I’ve highlighted the two biggest samples – almost an order of magnitude larger than the other samples (note that the order of the samples is swapped – this is a good example of the compiler reordering instructions into a more optimal format). From the last lesson, we know that fadds and fmadds are only 10 cycles long, so that shouldn’t cause the large number of hits. Let’s look at the C code in relation to the generated assembly:

In the optimised code the y position is first loaded, followed by the X velocity and Y velocity, then the x velocity is added to the already loaded x component of the acceleration value. It’s this addition that seems to be taking most of the time –  (we’ll look at the reason for the second stall later in this lesson). What is actually happening is that the fadds instruction is stalling, waiting for the x velocity to be loaded into the f10 register as the addition can’t take place until it has the correct data in all the appropriate registers. The large number of hits on this instruction implies that this memory load is taking a very long time indeed.

Memory access

In order to load data into a register on the CPU it must be transferred from the memory subsystem. Back in the good old days, memory access times were comparable to instruction times – each in the order of a few cycles. Over the following decades improvements in CPU speeds followed Moore’s law increasing their performance logarithmically while memory access speeds crawled along, improving performance by only 10% per year.

Comparison of relative speeds of processors and memory.

In order to reduce this memory bottleneck, CPU designers added a small amount of very fast, very expensive memory called a cache. These caches come in different flavours – Level 1 cache  is the fastest and smallest and is generally found on chip. It can only pull data from Level 2, which is larger and slower while level 3 cache (if your CPU has one) is larger and slower again. The highest order cache in your system is the one that retrieves data from main memory. Each element in a cache contains not only some data that corresponds to a copy of the data in a location in main memory, but also information on where that data comes from. Each cache can only access data from the level below (or main memory if its is the highest order one). A great explanation of how the different types of caches work can be found here.

In the case we are looking at, the instructions request some data to be loaded into a floating point register. The CPU will first query the L1 cache for that data and if it is already resident there, will return it within a matter of a few cycles. If it is not there then the L2 cache is queried and if it is present there then it will be returned as soon as possible, which in this case  (on the PS3) will be around 30 cycles. If however, the required data isn’t in the level 2 cache, then main memory must be accessed which takes hundreds of cycles (somewhere between 400 and 600 on PS3 and X360). This is an eternity as far as the CPU is concerned – hundreds of instructions can be executed in the time it takes to pull data from main memory.

If you had a single instruction that could take 100s of cycles, how often would you use it?

The following interactive page illustrates the relative times for accessing data from L1, L2 and main memory. Click on the different memory subsystems to see  how slow main memory is. (Page requires HTML5)

Interactive demo of relative cache/memory access times

As you’ve experienced above, retrieving data from main memory is very slow – painfully slow. (In this demo, the latency for the different gaps are; L1$-> CPU: 4 cycles, L2$ -> L1$: 30 cycles, Main Mem ->L2$: 600 cycles). An L2 cache miss is in the order of 50 times slower than the average instruction latency.

Minimising the memory bottleneck

So, what can we do about it? We need to use memory, so how do we minimise the impact of these long reads from main memory?

The simplest way is to just use less memory – minimise the size of your regularly used (hot) structures. Do you really need to an array of rarely used data (eg char name[32];)  in the struct or can you just maintain a pointer to it? If it is a large structure, can you break it up into hot and cold parts, or at least parts that are used in sequence?

The third optimisation is to use a special instruction that will start fetching some data for us, ideally well before we need it so that when we do need it, it will hopefully be in the L2 cache ready for us. This is called prefetching and, when used effectively, can have a dramatic effect on your code’s performance.

Prefetching

On the PlayStation3, the prefetch instruction is __dcbt(loc), where loc is the location of the data to be prefetched. It triggers a load of a single cache line (128 bytes)  into L2 cache. Let’s add one into our loop and see what happens:

Well, our stalls seem to have vanished, but what about the performance? The original code was running at around 10.8ms – this code now executes in 4.5ms! Nice! We’ve sped up our code by a factor of two by adding an instruction.

I chose (p+10) as the location to prefetch as the inner loop is about 40 cycles and fetching 10 particles ahead corresponds to 10*40 = 400 cycles which is approximately the time taken to fetch data from main memory. The graph below shows the different  times taken for different prefetch distances:

The first column corresponds to the code without the prefetch instruction, while the second column prefetches the particle that we are about to use in that iteration – note this indicates the cost of the prefetch instruction is non-zero, but very soon the effect of prefetching mitigates that cost. At 8 to 10 particles look ahead (192 to 240 bytes) we start to plateau. Now it is worth noting that the __dcbt() instruction loads in 128 bytes at once, so when we’re calling __dcbt() within our loop, we’re actually prefetching the same cacheline 5 times. This is still cheaper than testing if the next particle to be prefetched is in the same cacheline or not (if we unrolled the loop to do 5 particles (5 * sizeof(MyParticle) = 120)  at once we’d get better prefetch distribution (as well as pipeline utilisation), but we’ll look at loop unrolling in a later lesson).

The reason for the dual stalls

The reason for the dual stalls in the original code is due to the size of MyParticle relative to the size of the cache line. When a single particle (MyParticle is 24 bytes in size) is loaded, an entire 128 byte cache line is loaded. This means that 5 particles will be loaded in with that one, as well as part of the next one.  The following diagram shows the distribution of particles over 3 consecutive cache lines (the pattern repeats every 3 cache lines).

The first case where the struct aligns exactly with a cache line results in a hit on the first stall as the x velocity is used. The next 4 particles are all within that cache line and so do not incur another cache miss penalty. The first miss happens loading the 6th particle and contributes to the first stall once again – py is in cache, vx is the first miss.  The next miss contributes to the second stall as vx is already in the cache and the processor has to wait while vy is fetched from main memory. This behaviour matches our observations with approximately 2/3 of the hits being on the first stall and 1/3 on the second. When we added prefetching both of the stalls were removed (as we would expect).

Summary

What we’ve seen here is how to identify and fix a simple system which is preforming sub-optimally due to cache misses. The memory used is already contiguous and homogeneous, so prefetch look aheads are easy to implement. More complex memory access patterns can be considerably more difficult to optimise, so I would suggest that one of your first passes at optimising a system (after algorithmic optimisation) is to simplify your memory access. Determining what data is used in a tight loop and then ensuring that data is stored in a cache friendly fashion will not only give you an instant speed boost, but will make it easier to prefetch and thereby optimise further.

When designing a system that needs to be fast, it is critical that you consider the layout of the memory being used. Optimal code is irrelevant if your data is badly organised.

TLDNR: put all your data in flat arrays.

Coming up next

Now that we’ve removed the memory bottleneck, we’ll next have a look at further optimising this code by addressing the instructions used. In particular, we’ll start using some SIMD instructions and see how that impacts our performance.

Until then, remember, it’s all about the memory.

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Fri, 23 Mar 2012 15:46:00 +1030
Optimisation Lesson 2: The Pipeline http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline

Before we start on the second in this series of optimisation lessons I’d like to make something clear. The type of optimisation that I am predominantly talking about in these lessons is low-level optimisation, or micro-optimisation.  This type of optimisation should only be applied once you are sure that it will definitely have a benefit, once you have exhausted all obvious higher level macro-optimisations. There have been (too many) times in my coding career when I have spotted some obviously inefficient code and spent too much time joyously optimising it only to end up with rigid, highly optimised, unmaintainable, very clever code that makes absolutely no difference to the overall performance. Or, even worse, it runs slower.

Your first task in optimisation is to figure out what is slow, then why that something is slow (and sometimes when) and then remedy that.

Understanding the sample distribution

If you can cast your minds back to Lesson 1, we had profiled our particle system and had started examining the PC samples obtained from our profiler as depicted below.

I’ve highlighted the two lines with the highest hits and we’ll start by focusing on the second, biggest one first.

 blt 0x00011B30 

From the interactive C++ to ASM page from the last lesson we can see that this is a branch command and corresponds to the following C++ code:

 if(p.m_Position<0) 

To understand why this instruction is such a bottleneck we need to understand a little about how the underlying hardware functions – in particular the CPU pipeline. Modern CPUs can issue a command or two every cycle. These instructions take a variable number of cycles (depending on the instruction) to produce an output, but (and this is important), multiple instructions can be in flow at the same time – each one at a different stage in the pipeline.

For example, consider the following code which does 3 single precision floating point adds:

 fadds f4,f4,f0   fadds f13,f13,f30   fadds f0,f31,f29 

If each fadds instruction takes 10 cycles, how many cycles does it take to complete these three instructions?

The answer, as with most things when you get deep enough, is “It depends”. In order to simplify this problem let’s assume that the registers that are being used already have the relevant data in them – this way there is no stalling as the fadds instructions wait for the registers to fill from D cache or main memory (we’ll deal with that can of worms in the next lesson).

The following diagram shows some of the details of the PPU pipeline. It’s from this article and is definitely worth reading if you are at all interested in the Cell BE architecture.

Part of the PPU Pipeline

Each box in this diagram corresponds to a cycle – for example, the first 4 boxes, IC1 through IC4 correspond to the time taken for the processor to fetch instructions from the Instruction Cache –  4 cycles –  followed by a two cycle dispatch and then three cycles to decode and three more to issue the instruction to either the Branch Unit, Fixed Point Unit or the Load Store Unit. All instructions have to go through this 12 cycle front end, but if this pipeline is fully populated then a new instruction can be dispatched every cycle. The time taken for an instruction to enter, pass through and then leave a pipeline is called it’s latency. I won’t go into too much depth on the details of CPU pipelines here – check your hardware documentation for more information (for consoles this is generally under NDA anyway) and/or have a read of this Ars Technica article.

Back to the case in hand: The FPU pipeline is 10 cycles long which means it will take 10 cycles for a single instruction to pass through that pipeline (the time taken to get the instruction to the FPU can be ignored for this calculation). A new FPU instruction can be issued every cycle after that; this means that f0 from the third fadds will be ready after cycle 12.

It is worth noting at this point that some instructions (such as divide and square root on the PPU), do not pipeline at all. This means that all the instructions after that command will stall until it has finished, reducing throughput significantly (some non-pipelined instructions take more than 80 cycles).

The more attentive amongst you may have noticed the reuse of the register f0  in the third fadds – this is not an issue as once an instruction is committed into the pipeline the input registers are no longer relevant to that instruction. The compiler will take care of dependencies for you, so you shouldn’t need to worry about it, merely be aware of it.

Branching

“So,” I hear you ask, “what does this have to do with the branch statement in the code we’re looking at?”

Well, I’m getting to that. Branching can be an expensive operation and the PPU tries to minimise its impact by trying to predict which way the code will branch. It does this either statically (the compiler or programmer specifies which code path is most likely taken) or dynamically (the PPU keeps a table of branches and which path the most recent branches have taken and uses that to infer the most likely branch).  Check out Mike Acton’s article on branching for more details.

A mispredicted branch will incur a complete pipeline flush which is quite expensive (I’ve not mentioned the exact value for the PPU as I’m not sure if it is covered by NDA but you should be able to infer it from the earlier pipeline diagram) and even an accurately predicted branch has a slight cost.

If we once again look at the performance graph from the first lesson we are now in a position to understand what is happening to produce this peak on the right:

Graph of execution time taken over frame number (from Lesson 1).

The flat part of the graph on the left corresponds to the expansion of the particle ball while all particles are above the collision plane. When the particles start to collide with that plane, the time taken increases dramatically, rising from around 15ms per update to nearly 60ms (when nearly all the particles are at rest on the collision plane). This increase is processing time is due to not just the extra processing when a collision occurs (three fp multiplies and a fp negate), but also due to the branch and (potential) associated pipeline flush.

Removing Branches

Fortunately, there is an instruction that can help us here; fsel (floating point select). This instruction provides a way of effectively performing our own speculative execution and then choosing one of two operands based on a third (see below):

 d = __fsel(a,b,c); // is the same as the following code but with no branch instruction if(a>=0)      d = b else     d = c; 

The other benefit of this instruction is that it is pipeline friendly – it also takes 10 cycles and our pipeline is left intact once it has completed. And we can have multiple of these instructions in the pipeline at the same time.

So, let’s rewrite our particle code without any classes (to make it easier to see how it maps to asm), pre-emptively calculate the bounced position and velocity and pick the resultant position and velocity via fsel() commands based on the y value of the particle’s position.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 void ParticleManager::Update(float fClock) { const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0.0f, -9.8f, 0.0f) * fClock; // a*t const MyVector at2 = v3Acceleration*fClock*0.5f; // ½a*t^2   MyParticle *p = &m_Particles[0]; MyParticle *pEnd = &m_Particles[m_NumParticles]; for (; p<pEnd; p++) { float newPosy = p->m_Position.y; float newPosx = p->m_Position.x; float newPosz = p->m_Position.z; float newVelx = p->m_Velocity.x; float newVely = p->m_Velocity.y; float newVelz = p->m_Velocity.z;   newPosy = newVely * fClock + newPosy + at2.y; newPosx = newVelx * fClock + newPosx + at2.x; newPosz = newVelz * fClock + newPosz + at2.z; newVelx = newVelx + v3Acceleration.x; newVely = newVely + v3Acceleration.y; newVelz = newVelz + v3Acceleration.z;   float bouncedPos_y = -1.0f*newPosy; float bouncedVelx = newVelx*dampVelocity.x; float bouncedVely = newVely*dampVelocity.y; float bouncedVelz = newVelz*dampVelocity.z;   // choose vel based on y pos newVelx = __fsels(newPosy,newVelx,bouncedVelx); newVely = __fsels(newPosy,newVely,bouncedVely); newVelz = __fsels(newPosy,newVelz,bouncedVelz); // set y based on y pos newPosy = __fsels(newPosy,newPosy,bouncedPos_y);   p->m_Position.x = newPosx; p->m_Position.z = newPosz; p->m_Position.y = newPosy;   p->m_Velocity.x = newVelx; p->m_Velocity.y = newVely; p->m_Velocity.z = newVelz; } } 

And here is the assembly generated by compiling the above code.

 0x000003D0 bge 0x00000458 0x000003D4 lfs f9,0x4(r4) 0x000003D8 lfs f10,0xC(r4) 0x000003DC lfs f11,0x10(r4) 0x000003E0 lfs f12,0x14(r4) 0x000003E4 fmadds f9,f1,f11,f9 0x000003E8 lfs f13,0x0(r4) 0x000003EC fadds f0,f10,f5 0x000003F0 lfs f31,0x8(r4) 0x000003F4 fadds f11,f11,f4 0x000003F8 fadds f30,f12,f5 0x000003FC fmadds f10,f1,f10,f13 0x00000400 addic r5,r4,0x18 0x00000404 fmadds f12,f1,f12,f31 0x00000408 fmadds f13,f6,f2,f9 0x0000040C fnmadds f9,f6,f2,f9 0x00000410 fmuls f31,f8,f0 0x00000414 fmuls f29,f7,f11 0x00000418 fmuls f28,f8,f30 0x0000041C fadds f10,f10,f3 0x00000420 fadds f12,f12,f3 0x00000424 fsel f9,f13,f13,f9 0x00000428 fsel f0,f13,f0,f31 0x0000042C fsel f11,f13,f11,f29 0x00000430 fsel f13,f13,f30,f28 0x00000434 stfs f10,0x0(r4) 0x00000438 stfs f12,0x8(r4) 0x0000043C stfs f9,0x4(r4) 0x00000440 stfs f0,0xC(r4) 0x00000444 stfs f11,0x10(r4) 0x00000448 stfs f13,0x14(r4) 0x0000044C clrldi r4,r5,32 0x00000450 cmplw r4,r3 0x00000454 blt 0x000003D4 

If you have a HTML5 compliant browser, then you can interactively walk through the ASM and C++ code in another window by clicking on the following image:

As you can see from this assembly (particularly via the interactive walkthrough), the only branches are associated with the for loop which iterates through all of the particles.

We would expect this code to run faster as there are no more data accesses and there are less stalls in the worst case scenario (all particles mis-predicting a branch), but how much faster? The following graph shows the performance of the code in lesson 1 against this lesson’s performance.

As you would expect, the performance is constant as there is no branching based on the particle’s position, but what you might not have expected is that the baseline performance is better than the original solution. Even though we are calculating the bounced position and velocity in addition to the non-bounced ones for every particle, this code is still significantly faster than the branching one. Basically, we are doing more work in fewer cycles. The removal of the branch removes the need for a compare (which stalled waiting for the result of the increment of the y position) as well as the branch and, thereby allows better pipelining of the inner loop.

This is a pretty good performance improvement for a fairly small change, but finding wins like this in production code is rare. This type of optimisation really only benefits very tight inner loops; using branchless logic in higher level code will usually have no measurable performance improvement at the cost of obfuscating your code – so choose wisely.

The important things to take away from this lesson are:

1. Instructions pipeline. Taking advantage of that can give you big performance wins
2. Branching can be bad on certain architectures (primarily because it trashes the pipeline)

Coming up next

So where to next? Let’s have a look at the PC samples of this new code and see where the bottleneck has moved to:

Samples attributed to C++ code

Most of the samples in this code seem to have coalesced around the calculation of the y position of the particles. The reason for this is very interesting and we’ll address this in the next lesson.

So endeth the second lesson. Please leave a comment if you’ve spotted an error, need something clarified or would just like to leave some feedback.

 I’ve just been pointed at an excellent article on pipelines and branchless maths: “Down With fcmp: Conditional Moves For Branchless Math”

]]>
rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Thu, 10 Nov 2011 15:43:00 +1030
Optimisation Lesson 1: Profiling http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling http://www.overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling

The performance of a game is critical – it defines the scope of a game. It specifies how many, how far, how much detail, the level of freedom, even the length of the project. Whether you are an artist, designer, producer or programmer, the performance of a game has an impact on you. As Jeff Atwood recently stated, “Performance is a feature” and as with all features, they need to be scheduled, researched, implemented and maintained.

This article is the first in a series based on the Masterclasses which I’ve recently presented for Game Connection, and will walk you through several phases of optimisation performed on a simple piece of code. It is primarily aimed intermediate programmers  or experienced programmers who haven’t had much to do with performance and low level programming. The aim is to give a detailed example of finding performance bottlenecks and then addressing them. It should give you and idea of how code can be optimised and the magnitude of performance improvements you can expect from different changes. The series of lessons will cover profiling, branch removal, cache aware programming, loop unrolling and SIMD programming.

Introducing A Simple Particle System

The first piece of code we’re going to look at is a very simple particle system. We will look at the update phase only, ignoring the rendering and GPU performance for this example considering only the CPU for now. The CPU in this case is the PlayStation®3′s PPU and external profiling is done with SN System’s excellent Tuner.

There are a couple of simple classes, the vector class  MyVector and the particle class, MyParticle.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 class MyVector { public: MyVector(){}; MyVector(float X, float Y, float Z) {x=X;y=Y;z=Z;}   inline const MyVector& operator+=( const MyVector& rhs ); float x,y,z; };   inline const MyVector &MyVector::operator+=( const MyVector& rhs ) { x += rhs.x; y += rhs.y; z += rhs.z; return *this; }   inline MyVector operator+( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z ); }   inline MyVector operator*( const MyVector& lhs, const MyVector& rhs ) { return MyVector( lhs.x * rhs.x, lhs.y * rhs.y, lhs.z * rhs.z ); }   inline MyVector operator*( const MyVector& lhs, float rhs ) { return MyVector( lhs.x * rhs, lhs.y * rhs, lhs.z * rhs); } 

and the particle class;

 1 2 3 4 5 struct MyParticle { MyVector m_Position; MyVector m_Velocity; }; 

The particles in this system will accelerate under gravity and will bounce on the y equals zero plane with some damping applied to approximate energy lost in collision. If we can remember back to our high school maths, the equations of motion are;

${v}' = v + at$

${p}' = p + vt + \frac{1}{2} at^{2}$

So, the update loop for the code is:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 void ParticleManager::Update(float fClock) { int iParticle; const MyVector dampVelocity(0.75f,-0.5f,0.75f); const MyVector v3Acceleration=MyVector(0, -9.8, 0) * fClock; // a*dt   for (iParticle=0; iParticle<m_NumParticles; iParticle++) { MyParticle &p=m_Particles[iParticle]; // v' = v + a*dt p.m_Velocity += v3Acceleration;   // Update the particle position // p'=p + v * t + 0.5 * a * dt^2 p.m_Position += p.m_Velocity*fClock + v3Acceleration*fClock*0.5f;   // if Particle is below the ground plane, then damp and reverse the velocity if (p.m_Position.y<0) { p.m_Position.y*= -1; // reflect position along y p.m_Velocity= p.m_Velocity*dampVelocity; } } } 

This code gives us the following output (for 256k particles)

A simple particle system in action.

Step 1: Profile

The first thing we need to do is figure how fast this code is running, and for that we need a useful metric. Frames per second is borderline useless and a percentage of a frame is just as bad. Use a sensible metric like milliseconds or microseconds – then you can then accurately measure the length of a function independent of the frame rate. Also, if possible, capture profiling information over a number of frames. This will allow you to find spikes and patterns in the execution of your code. When optimising code, the spikes are generally what need to be addressed first (and are often the easiest to fix too).

Consider the following graph of this particle system. The vertical axis is milliseconds taken to process all of the particles and the horizontal axis is time (# frames)

Graph of execution time taken over frame number.

Note the shape of the graph – it starts at around 17ms per frame and then dramatically sweeps up to sitting at about 60ms per frame. There is obviously something interesting happening here that requires further examination. Lets run a profiler over this code and see if we can get a better idea of what is happening with this example.

There are two main types of profiling that are generally used; instrumented and sampling.

Instrumented Profiling

Instrumented profiling relies on either manual or automatic insertion of code that stores a time stamp at set points for later analysis. Manual instrumentation relies on the programmer inserting code which wraps the code to be timed, for example, at the start and ends of a function.

 73 74 75 76 77 78 79 80 81 void FunctionThatIsTakingALongTime() { PROFILE_START("Function Name"); for(int i=0; i<BigNumber;i++) { Do_Something_Expensive(slowly); } PROFILE_STOP(); } 

Manual instrumentation allows a programmer to store and use the timing information in any way that they see fit - on-screen profile bars, graphs or possibly profile data output to file for analysis in a separate application. It is the most versatile method of profiling, but care must be taken that the actual implementation of the profiling stubs aren’t too intrusive, slowing down the execution of the code being profiled. The graph above was produced using manual instrumentation.

Automatic instrumentation is performed by most modern profilers, and will allow runtime insertion of start/stop profiling stubs into code. This method is very useful, allowing a programmer to change which functions to instrument at run time, but relies on the profiling data being collected and displayed by an external application.

Another form of automatic instrumentation is call profiling which relies on the compiler inserting pre- and post-amble profiling code to every function. This method of profiling gives 100% coverage but is very intrusive (it slows down the running of the profiled code significantly).

With both types of instrumented profiling, the more instrumentation that occurs, the more intrusive it is and the slower the code will run. Instrumenting small, high frequency functions can artificially bloat the time taken in those functions and their parents, so care must be taken not to instrument too much.

Example of high frequency instrumentation slowing performance.

In the above diagram, each coloured bar corresponds to a function call, with the top ones calling the ones directly below them and so on, down to the fine grain leaf functions at the bottom. While this visual representation of the code path is useful for understanding the flow of code, it does slow down parts of the code non-uniformly and isn’t necessarily an accurate depiction of the relative performance of the profiled functions.

Program Counter Sampling

This form of profiling requires an external profiler which samples the executing code’s program counter (PC) at set intervals over many frames, collecting these values for later analysis. The Program Counter is a CPU register which stores the current address of execution. Each PC value and the time that it was taken at can be reconstructed to determine which functions were being executed at which time. From this data we can see the overall execution footprint of the executing code – slow code will have more samples attributed to it while the fast code will have less.

PC Sampling Example in SN Tuner

In the above example, we can see that memory allocation is taking a long time and should be investigated further. The annotation has been added manually.

PC sampling, as with instrumentation, can be intrusive and affect the execution speed of the code. The higher the frequency of sampling, the more intrusive it is (although, in this case it has a uniform impact across the entire frame) , so it is advisable to minimise the sampling frequency and collect over many frames instead to get a more indicative performance measurement unless you explicitly need high frequency information for some reason.

Profiling the Particle System

So, back to the problem at hand; we profiled the particle system using PC sampling and the profiler  attributed the samples to the following lines of code:

ParticleManager::Update() samples

The first column is the total number of samples at that PC, while the second column is the number of samples for that frame. Basically, the larger the number attributed to a line, the slower that line is to execute. The problem is that the PC samples correspond to particular assembly instructions and the application is responsible for determining which lines of C++ source those instructions are part of. This is a non-trivial problem (See the interactive example at the end of this lesson), as optimised code will generally move instructions around into an order which provides the best performance, so what we are left with is an approximation of where the slowest lines of code are. At the function granularity these samples are accurate, just not at the level of C++ instructions.

So, in order to accurately see what is taking the most time within a function, we must look at the disassembled instructions for this code fragment. I know what you’re thinking. You’re thinking

“Assembly!? I can’t read that. Only really old guys and freaks can read that!”

But don’t worry, it isn’t necessary to be fluent in the asm of your platform of choice, but it is important that you are able to understand the gist of the disassembled code. It’s like being in a foreign country – you don’t need be conversational (although that is ideal) but it will help immensely of you know some of the basics; the “please”, “thank you” and “Excuse me, where’s the bathroom?” – the phrases you use all the time.

For this lesson, I’ll not attempt to teach any assembly. I’ll leave that as an exercise for the reader  - it should suffice for this PPU based example to know that instructions starting with an ‘l‘ are load instructions, and those starting with ‘s‘ are stores. Instructions starting with an  ’f‘ are floating point operations – the full mnemonic is generally enough to guess the rest. If you want excruciating detail, the PPC books can be found here.

The next image shows the PC samples attributed to the individual assembly instructions. This tells us accurately which instructions are taking the longest to execute.

The numbers on the left are the samples attributed to that line of assembly, and this tells exactly what code we should be looking at. So, how do these instructions relate to the C++ code?  Well, I’ve coded up a simple interactive page which will allow you to step through the C++ of the inner loop and its generated assembly instructions interactively (as long as you have a HTML5 compliant browser that is – otherwise you’ll just get a blank page. In which case you need to go here). The C++ code being executed is highlighted on the right and the corresponding ASM on the left is shown so you can see what the code compiles into. Take note of the fact that there isn’t a direct linear relationship between the C++ and the ASM, the generated ASM can be quite out of order (even in this simple example). The blue box in the bottom right give a description of what is happening for that line.