Overbyte - Latest blog entries http://overbyte.com.au/index.php/overbyte-blog/latest Mon, 27 Mar 2017 14:49:34 +1030 en-gb Vessel Post Mortem: Part 3 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-3 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-3 This is the final part of the Vessel Porting Trilogy. The previous parts can be found here 

After the Christmas break (where I was forced into yet another family vacation) I dove straight back into the code, hoping to get the final issue fixed and the game submitted as soon as possible. There was still time to submit and if I passed Sony’s submission process first time then we could still have the game on the PSN store by the end of January. I fixed the remaining problem with the start up error reporting and added in a little more error checking just to be sure. I was pretty paranoid about the submission process as I didn’t want to delay the game anymore by failing - I carefully checked everything twice, crossed my fingers and hit the ‘submit’ button to Sony America (SCEA) on January 9.



In parallel to the code submission process there is a ‘Metadata’ submission process. This corresponds to the PlayStation Store assets and the game’s presence there. It consists of all the text, images, trailers, prices etc and they all have very specific requirements that must be met in order to pass submission. James (our QA guy) managed most of this and involved communicating with Strange Loop’s art guy Milenko (who was incredibly responsive - I’m not sure he ever sleeps) and consisted of asking for different resolution images and screenshots, as well as sourcing the different language versions of the game text. It took us a few submissions of the meta data to get it all right, but the turnaround was pretty quick and a continuous dialog with the Sony helped a lot.

The code submission process consists of uploading the game in a special package format plus some extra documents that describe the trophies and other bits and pieces. We had to submit to both SCEA and Sony Europe (SCEE) so we could have the game released in both those regions. We hadn’t submitted to SCEE at the same time as SCEA as we were still waiting on some publisher registration details to come through, so all I could do was wait for that and for the response from SCEA on our initial submission while I busied myself with some other work.

On January 18th I received the first fail from Sony. As fails go, it was a pretty good one. There were three “Must Fix” bugs: one was due to my entering the wrong data in a field when submitting the game package, and two were due to saving games failing when there was no disk space left. They’d also mentioned some slowdown in some of the levels - I’d expected that, and as it wasn’t a critical bug, I ignored it. The save game bugs proved troublesome - Ben had written all of the save game code and with him gone I now had to learn how Sony wanted you to do it, how Ben had done it and how I could make it work correctly. It took me a few days to find and fix the problems and by this time the details that SCEE required had come through so I resubmitted to SCEE and SCEA at the same time (January 24)

I was quietly confident that it’d all pass this time. I’d thoroughly tested the save game code now and it all worked. What could go wrong? I seriously considered heading out and buying a bottle of Chimay Blue as a pre-emptive celebratory reward.


Image from https://bonbeer.wordpress.com/category/places/mechelen/

The first fail I saw came back from SCEE on February 2nd. I’d messed up the submission data again (I hate filling out forms) plus there was a legitimate bug they’d found where once you’d finished the game you couldn’t continue on from that save game if you wanted to continue playing to complete all the trophies. When I had fixed that I re-submitted the new build to SCEE on February 4th and then began to wonder what had happened to the SCEA build - it should have come back at about the same time as the SCEE one. I also worried that they would pass it with the “continue when finished” bug in it. I needn’t have worried, as the SCEA team came up with a whole slew of new bugs that weren’t mentioned in any of the previous tests. 

This new bug report that SCEA had sent back was very concerning. The crashes they were reporting sounded like bugs I’d previously fixed - I was also impressed with their thoroughness. One of the bugs was something like “Hang from the rope in level blah and throw seeds for at least 5 minutes into an area with no water. Crashes 70% of the time”.  While looking at this bug I received notice from SCEE that they had passed Vessel. Which was great, but it was also worrying as I now had numerous repeatable crashes in the build - so I failed the SCEE build myself in order to be able to submit a fixed build which would match the SCEA build.

By now I was getting a little frazzled. It was like one of those movies where the protagonist has slain the evil uber zombie, only to have it rise from the dead over and over again. Would this submission process never end?



I eventually tracked the crash bugs down to a simple stupid error. There was a section of code that was responsible for limiting the amount of fluid, fluros and seeds in a given section. When the frame rate dropped below 60fps, this code would kick in and drop the number of drops/fluros or seeds to the minimum value deemed suitable for that level. During the final stages of development we had created a FINAL_RELEASE mode which turned off all of the unnecessary debugging code. Unfortunately, this erroneously included some code which updated the time values that the limiting code used, so the fluid, fluros and seeds were never being reduced. Ever. This meant that the game would have been running much slower than it should have, and was prone to crashes whenever certain hard coded limits were exceeded. I've never been so relieved to fail anything.

Something I’d like to focus on here is the level of performance that Vessel was submitted with; Ben and I spent a huge amount of our time just trying to squeeze as much performance out of this game as possible on the platform we had. For the most part, I'm pretty happy with the results. Yes, you can abuse the game and make it slow down quite easily - if you try to fill a room with fluros and water and seeds then you’ll most likely see some slowdown - but during normal play, the game maintains 60fps simulation at 30fps frame rate (the graphics is 30fps, so there are two simulation passes per visible frame). However there are a few levels where the sheer number of fluoros and fluid required to solve the puzzles means that the frame rate will drop to 20fps. Given another month I reckon I could have fixed that too, but given how far over time and budget we were, that wasn't an option. So, while I’m happy with how much we improved the frame rate of the game, I'm still a little disappointed that we didn't hit a full 60fps simulation everywhere.

Sure, the PlayStation 3 hardware did slow down the porting of the game initially, but given the complexity of the fluid simulation required I doubt that there will ever be an Xbox 360 version released. The SPUs allowed us to perform at least 80% of the physics simulation completely in parallel - with no impact on the main processor at all. There is no way you’d get the same level of performance on the X360.


With the FINAL_RELEASE bug fixed and the game resubmitted to SCEE and SCEA (which passed on the 20th and 22nd respectively) I was finally free! The metadata was all sorted and we’re now expecting the release on March 11 in the SCEA territories and March 12 in the SCEE regions. We’re all hoping it does well enough to cover our costs (as all devs with newborn progeny are).

What went right

Having a QA guy in house was invaluable. An impartial, non-technical third party to throw our fixes at kept us honest. As a developer you get very close to the game and like to assume that your fixes have made a positive difference and you’re moving forwards. That’s not always the case.

Honest communication with the Publisher/client. I tried to keep John at Strange Loop Games constantly up to date with the progress (or lack thereof) of the game. He was incredibly understanding with the continuous delays, and while it was hard to deliver bad news, at least it reduced the surprises.

Having good, experienced coders on the job. Bringing Ben onboard was a good choice, even though we had to let him go in the end. There is no way I could have done this without him. Working from the same physical office was also beneficial. I worked remotely for over 5 years leading up to this port, and I think we’d have delivered even later if we’d worked separately.

What went wrong

Work estimation. I fundamentally underestimated the amount of work required. I should have applied Yossarian's Circular Estimation Conjecture and multiplied it by PI - that would have been closer to the mark. The big killer was the misunderstanding with which threads needed to run at 60fps. If it was just the physics I'd underestimated, we probably would have been a shade over a month late (maybe two), but with both the physics and game threads needing to execute in under 16.6ms, we really had our work cut out for us. The amount of time taken for submission shouldn't be forgotten either; 4 to 8 weeks after the initial submission should see your game on the store.

I’d also recommend to anyone doing a console game that they get as much TRC compliance done as soon as possible in the development of the game. Get the load/save stuff working, trophies, error reporting, quitting, controller connection, file error reporting and the like in place and functioning sooner rather than later.

In Closing

The porting of this game was a trial for me. There was a week or two around September/October where I was really doubting that we could deliver a game that ran at a playable frame rate. I'm proud of what we delivered in the end, and if you’d taken away the financial and time pressures I would have thoroughly enjoyed the challenge. I've learnt a lot from this experience, and I'm looking forwards to the next one.

Just not right now, OK? Maybe a bit later in the year.


tony@overbyte.com.au (Tony Albrecht) News Tue, 11 Mar 2014 09:00:05 +1030
Vessel Post Mortem: Part 2 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-2 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-2 In the last episode, our intrepid heroes had discovered that rather than having almost finished the PlayStation 3 port of Vessel, they were barely half way there. Read on to find out what happens next…

With the realisation that both the game and render threads needed to run a 60fps, we had to reassess where we focussed our optimisation efforts. The game thread was doing a lot of different jobs - Lua scripting, playing audio, AI, animating sprites - all sorts of things. There was a lot to understand there, and on top of that, both the physics and render threads still needed more work. But performance wasn't the only concern - there was TRC compliance, game play bugs, new bugs introduced by our changes and, as we eventually discovered, the need for a PC build (more on that later).


Now, Vessel  was ‘mostly’ bug free on PC when we started - we did find a few bugs in the Lua and in game code and the few compiler and shader compiler bugs added to that, but there were subtle differences in the way the two platforms dealt with numbers which meant that weird things sometimes happened. For instance, a door on one of the levels would get stuck and your character would have to bump into it in order for it to open. This was due to a variation in position values about 5 decimal places in causing the jamming on PS3. Additionally, there was some code that was used in a number of places that did something like the following;

x = MIN( y/z, 1.0);

What this did on PC was catch the case where z tended to zero and clamped x to 1.0 ( MIN(infinity, 1.0) = 1.0). Unfortunately, on PS3 value/0.0 was QNAN and MIN(QNAN, 1.0) was QNAN. So the clamping never happened resulting in much weirdness in hard to reproduce places (it only occurred when z was zero), and was therefore a bugger to find.

This meant that we weren't just optimising and adding in new PS3 functionality, we were also debugging a large foreign codebase. I hadn't expected that we'd need to change Vessel much - it was a shipped game after all and so I had assumed that it was pretty much bug free.

I had also assumed that for the most part we wouldn't need to change any assets significantly. Things like platform specific images for controllers and buttons were just texture changes and so weren't a problem. Text was fine too. Unfortunately, problems like the sticking door mentioned above meant that we need to go into some of the levels and jiggle bits around. Later in the life of the port we had to tweak fluid flows, emission rates, drains and other fluid based effects to help with performance. 

All of this meant that we needed to have the editor up and running, and the editor was built using the same engine as the game so we needed to maintain two builds of the game - one for PS3 and one for PC solely for the editor. This was done crudely by #defineing most of our PS3 changes out and leaving the PC code in there inside the #else conditional. This did slow us down a bit and also made our code that much uglier, but it had the side effect of allowing us to easily test if a bug was present in the PC build or was specific to the PS3.

By the end of September we had fixed many audio issues and managed to get LuaJIT (an optimised version of Lua) working again (it was mostly working but was causing some repeatable problems in some places). Ben also profiled the Lua memory usage (article here) and tweaked its garbage collection to be more CPU friendly (it was occasionally peaking at 13ms in a call). So, slowly, we were scraping back time spent on the game thread. It was still a serious problem, but it was getting better.


Back on the physics system, more and more the CPU code was being ported to the SPUs - in many ways, this was getting easier as I was building a lot of infrastructure which made it more convenient to port. Many of the patterns that I used for one system would translate well to others. I started reducing the amount of memory used by parts of the physics, like dropping the size of the WaterDrop struct to 128 bytes from 144 bytes meant better cache usage and simpler processing on SPU (nicely aligned structs meant that I could assume alignment for SIMD processing).

Some optimisations were straightforward.  For example, I changed the way that we processed drops on fluros skeletons. Instead of iterating over the entire drop list and checking each drop to see if it was part of skeleton ‘N’ I changed it to pass over the drop list once, partitioning the drops into buckets of drops per skeleton. Then we could just process a given skeleton, touching only those drops on that skeleton. Obvious in hindsight, and unnecessary on the PC build, but it takes time to get to the point where you really start to understand the code and can begin to change the way it works (rather than optimise a function to perform the same algorithm, just faster).

At this time we also saw the transition of the render thread to the status of low hanging fruit. So we started working on that - unrolling loops, prefetching, minimising data copying. Tweaking, poking, prodding and breaking, testing, measuring, then swearing or cheering and checking in.

The end of September rolled around and we still had no real end in sight. I kept talking to Strange Loop, trying to keep them up to date while we furiously worked to finish this game. As we’d already blown the initial quote, our contract specified  that we would have reduce our daily rate for the remainder of the project, to be recouped on sales. So not only did we have significant time pressure, we now had financial pressure on top of that.


October was very productive - it saw over 150 bug fixes and optimisations checked in for the month. As the game was improving in performance and becoming more playable James, our QA guy, was able to play more and more of it. We hadn't actually managed to do a complete play through of the game by this stage, so we focussed heavily on removing all blocking bugs and getting the game in a functionally shippable state (assuming that we would continue to make performance improvements).

Second week in, after discussions with Strange Loop Games we decided to postpone the release of the game to January. This meant submitting the game to Sony in Europe (SCEE) and in America (SCEA) by the end of the year, a mere 4 months after our initial estimate. There was no way I was going to let this date slip again.


Ben was still working on optimising the Lua execution and audio (which at this stage was still taking 4 or 5 ms per frame with peaks way higher than that). I’d thought that as Vessel used FMOD that it would all just work when I turned it on. Unfortunately, the audio was a complete mess on the PS3. While profiling the game thread in detail we discovered that it was taking up huge amounts of time, spiking and causing frame stutters as well as just plain behaving badly. It took weeks of intermittent work plus lots of discussions with the FMOD boys to figure out what the problems were. Turns out, many of the sounds were just set up incorrectly for the PS3 build - effects and layers that were disabled on PC were left on for PS3. Some sounds would loop infinitely and multiple sounds would play at the same time. All these things took time to discover, time to investigate a solution and time to fix. Something we had very little of.

To make things worse, Vessel performed no culling for the rendering, AI or audio. At all. The entire level was submitted to the renderer which culled it and then passed off the visible bits to the SPU renderer. Also, Lua scripts (which are slow at the best of times) were being called even when the things they were modifying weren't in view. Audio was also being triggered by objects that were a long way from the player, relying on distance attenuation to reduce volume. All of this meant that there was a lot of work going on under the hood for no visible or audible impact.

We were still stumbling across weird little issues. For example, the animated textures in the game were never animating. They were loading correctly, but never changing. Ben tracked this down to an endian swap function which was broken (the PC and PS3 have different byte ordering for their numbers and so the same data when loaded from file must be modified depending on platform). This endian swap was only ever called in one place and only by this particular animated material. 

The fix is often easy. Finding what to fix is hard.


Even though we had a tight timeline, we also had other client obligations which saw us each lose 2 weeks in November. This helped financially (well, it would have if they had paid on time) but put even more time pressure on us. Regardless, we were confident that we’d make the end of December deadline. Plenty of time.

We finally managed to get a complete play though of the game in November. Only to realise that the endgame didn’t work. Fix, patch, hack, test.

It was during the last week of November that Ben had a breakthrough. He bit the bullet and implemented a simple container based culling system for the game thread. All objects in a loaded level were placed within a rectangular container and there were an arbitrary number of these per level. This meant that we could quickly check what should be rendered or processed (Lua or audio) by looking at the container in view and those next to it. Conceptually simple, but implementation required that we modify the editor (which we’d not done before) and renderer, then visit each level in the editor and manually add the new containers then re-export everything.

This fix made a massive difference to performance. Audio was faster as it wasn’t playing as many sounds. There was less Lua processing happening as there were fewer things active. And rendering was faster as there was less being sent to the render thread. So it worked, and it worked well. In fact, we had to limit the frame rate to 30fps as some places were now rendering at 60fps on all threads!



The container fix was instrumental in getting the game thread to a more playable speed, but it also introduced a lot of little bugs that took a month to fix. And still, it was too slow.


With only three weeks until the Christmas break we figured we needed to start cutting back on the content in some of the levels in order to get the frame rate up. We picked a few of the worst and closely examined them to try and figure out how we could speed them up. We found the following; 

  • Trees were very costly (as were their roots). So we cut some of them down.
  • Drains were expensive, so we removed some of them.
  • Areas with lots of water were obviously expensive, so we reduced the amount of water in some areas, cutting back on the flow or adding drains to reduce the amount of water in pools.
  • We tweaked the object limiter code to ensure that there were always enough fluros to finish a given level, yet not too many to make it run too slow. Same with the number of drops and seeds.

All of the above helped, and the game was now running pretty well. There were still areas where it would slow down, and you could (and still can) slow down some areas intentionally by spraying lots of water and generating lots of fluros, but there was no time left for further optimisations. I'd already built 13 different SPU tasks to speed the physics up and one or two for the render thread - it was getting very hard to speed things up more, and at this stage it was risky to make any extra significant changes. This would just have to do.

James was now noticing some play specific bugs. Some fluros weren't moving correctly - jumping too high and destroying themselves on the scenery or just behaving badly. Which is fine in most cases, but this was happening with a key fluro that you needed to use to complete the level. We had to modify the level to make it work again.

In addition to the optimisations that we were still implementing, and the bug fixes for the odd things that were happening, we also had to make sure that the game was TRC compliant. James trawled through that massive missive, highlighting things that we needed to check - the ways that save/load games functioned, how the game shut down, what would happen in the case of no hard drive space left, the order that trophies unlock, so many things. And so little time.


And, on top of that, the financial pressure on the company due to the length of time that Vessel was taking to port, plus the reduced pay rate we were getting for the overage and the fact that there was very little work lined up for the new year meant that I had to notify Ben that we were going to have to let him go.  

It was one of the hardest things I've ever had to do. I felt like I’d betrayed him - he is an awesome coder and he’d become a good friend. And just before Xmas FFS. To make things worse, the day I had to do it he was working from home and I had to let him know over Skype. It was just awful.

He called me back within a couple of hours to tell me he had just managed to land a new job. In two hours. I told you he was good.


From the left: Tony, Ben and James.


So, back to the code. With a week and a bit to go we mainly focussed on the remaining TRC issues. The game was running as good as we were going to get it to in the time we had and I was satisfied with that. TRC violations were disappearing and it looked like we might just make it.

In the last week (on the Wednesday)  I realised that the file format that we’re using needed to be DRM enabled in order to be TRC compliant, so I spent 17 hours straight trying to fix it. Did my first all nighter in a long time and managed to solve the problem at 5am. I pushed through the next day, fixing, tweaking, hacking, testing while full of sugar and caffeine and then dragged myself home to sleep on the Thursday night.

Submission Time

The final Friday arrived. We performed some clean ups and readied the code for submission. We were going to make it! On one last pass over the TRC checklist I realised that we’d missed something. We needed to be able to report an error if any of the required startup files were missing, but the way that the engine was designed, there was no way to display anything until those critical files were loaded. We had a few hours so I quickly hacked together an emergency renderer that I could use in the case of a critical failure. I gave myself 2 hours to do it and it was working within 30 minutes. Awesome! But, upon further testing, the loading screens refused to work. I still have no idea why - the code I added was never executed and yet still affected the loading screens. The game itself played fine, but those bloody loading screens didn't bloody work. I wasn't willing to submit with what I knew would be a TRC failure so, once again, we’d missed our deadline.

We'd have to delay submission until next year. 


Continued here: Vessel Post Mortem: Part 3

tony@overbyte.com.au (Tony Albrecht) News Fri, 07 Mar 2014 08:53:50 +1030
Vessel Post Mortem: Part 1 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-post-mortem-part-1 I started looking at Vessel in January 2013 - initially just in my evenings. In my ‘spare’ time. I was looking at it because John and Martin (the founders of Strange Loop Games and the creators of Vessel) had asked me to consider porting it to PS3.  We knew each other  from our time together at Pandemic Studios in Brisbane where I had worked closely with Martin in the Engine team while John worked as an AI programmer. I was keen to do the port but I wanted to be as sure as possible that I knew how much work was involved in it before committing.


I found the idea of porting such a game daunting - it had a custom rendering engine, a bespoke fluid physics system with major parts of the game code written in Lua - but I figured it was time to step up and take on such a challenge. I have an immense amount of respect for John and Martin - they are both incredibly smart coders and I relished the chance of working with them again. I wanted to do it but I didn't want to let them down. They were going to be my clients, but they were also friends.

I started by looking at their code - in order to give a quote on the porting process I needed to have the game running so that I could profile it. Vessel had already been partially ported to the PS3 by a third party, but wasn't compiling when I first got my hands on it. So I spent a few weeks of long evenings fixing the FileIO, the template compilation errors (GCC and MSVC disagree on some of the finer points of templates), and just commenting out anything I didn't think was relevant in order to get the game up and running. Once it was working, I ran my trusty profiler over the game and tried to guess how much time it would take to get this baby to a shippable state. What I found was a little scary. 


Visualisation of the code changes made during the porting of Vessel to PS3.

Vessel runs predominantly on three threads. The game thread which runs all the Lua scripts which control the majority of the game logic and in game objects, the render thread which takes the output of the game thread and converts it to a format that can be fed to the SPUs which then build the graphics command buffer and feed it to the GPU in parallel, plus the physics thread which manages all the physics processing - fluid dynamics, rigid body collisions, everything. John had built the physics system himself and it was pretty impressive (and complicated). It was too much to grok initially, so all I could do was look at the game’s performance on some of the worst levels, profile it there and see how much I needed to speed it up.

I investigated a few of the levels on the PS3 build and picked two that I thought were the worst. What they told me was that the physics thread was the primary cause of slowdown and so I concentrated my analysis on that thread. The game was running at about 60ms per frame in the worst levels I was testing - about 15 to 20 frames per second. From discussions with Strange Loop, I knew that the physics thread had to run at 60fps so this meant that I needed a factor of 4 speed up. From experience, I knew that this was something that was achievable. The game thread was running at around 30ms per frame and the Render thread was clocking in at around 5ms per frame. SPU rendering was pretty fast, and I figured that wouldn't be much of a problem.

Months of fannying about took place and a contract was finally signed in June. So I did the only appropriate thing one could do in that situation - I did two weeks of work on Vessel and then went on a 6 week family holiday to the UK.




In anticipation of the Vessel contract I’d hired an experienced programmer, Ben Driehuis. Ben came highly recommended and had worked on the Bioshock series of games, plus, most importantly, he was living in the same city as me. Having spent the last 6 years working remotely I knew that my best bet of delivering this game on time was to be in the same physical space as my team. I threw Ben in the deep end. He was hired and then given work on platforms and engines that he’d never had to deal with before. Then when I buggered off to the UK he had to manage by himself, dealing with a performance contract on a big name title and then kicking on with Vessel in my absence. Poor bugger.

I also had a friend who was interested in the QA side of things, so James (an old Uni mate) was responsible for playing Vessel over and over and over again, reporting bugs to us and telling us that the performance improvements we’d just made hadn't made any difference. James also helped us out on the testing for the TRC (Technical Requirements Checklist - a massive list of tests that need to be checked off in order to get your game through submission to Sony) and the actual submission of the game.

Once I got back from my holiday we moved into the new office and started working in earnest on the Vessel port. Our goal was to finish Vessel by the end of August - we had about four man-months to do the work, and I had carefully calculated how much time each section of the port would take. Surely that was enough time?

Ben and I divided up the work between us: I took on the optimisation of the physics and Ben was to deal with pretty much everything else - audio was broken, some of the in game objects didn't animate, loading needed to be optimised, memory usage had to be reduced, we needed save games and trophies and a whole swathe of PS3 specific features implemented. Ben has written about some of the fun he had during the port here, here and here.

From my initial inspection of the code, I had noticed a lot of memory allocations taking place, so I was optimistic that I could reduce (or remove) these and scrape back a lot of frame time. I also figured that rearranging the memory usage to be more cache friendly would help the PlayStation 3’s main processor so I could leave the core of the physics code alone. The last thing I wanted to do was to port all of the physics to SPU.

Three weeks later I started porting the physics to SPU. The changes I was making were incrementally improving the performance, but I could see that we’d never hit the speed we needed without dramatic changes. Porting code to SPU isn’t like porting code from one platform to another. You can program SPUs in C++, but they have very limited memory available to them - only 256kb each. So, in order to port any significantly complex task, you need to understand how that task uses memory - where it is, what order it reads it, where it writes it - the things that most programmers of high level languages take for granted. Not only that, but you have to consider how the task will parallelise. Running on one SPU is good, but running on 6 is awesome.  (I’ll go into more detail on the SPU porting process in later articles.)

The first SPU task I built was one that looked for drops that were near a set of objects. My first shot at this shaved almost 2ms off the frame time - the function in question dropped from a peak of 2.4ms to 0.6ms! That meant that in order to hit 16ms per frame I just needed to do this 21 more times! 


Still, Ben and I kept hammering away at the code. By the end of the first week in August I’d shaved 15ms off the Physics thread and yet that thread remained the bottleneck. Ben had freed up over 200MB of memory and we were close to running on a retail PS3. But we’d started to notice a few things - some of the levers and triggers in the game weren’t working, fluid shaders weren’t working properly, we were getting spikes of over 200ms in some cases and the game still wasn’t playable due to not just the levers not working but the AI of the fluros seemed to be broken (some would teleport away).

It was evident at this point that we weren’t going to hit the end of August deadline. Other projects had taken some time from both Ben and myself, so we still had billable time available going into September, so I let Strange Loop know that things were going to take a little longer than I had expected. They were very understanding and comfortable with the release date pushing out a little more.

Things were progressing well though - by the end of August the physics thread was running at better than 33ms per frame - we were over halfway there! Save games were in place, loading was optimised, and trophies were functioning. Unfortunately, the game thread was slowing down - as Ben was fixing problems with the code and data, more things were being processed. For example, audio wasn’t working before and now there was a lot of audio being triggered so this obviously took more processing time.

It was the second week in September when I realised that I’d made a horribly flawed assumption about the game thread in Vessel. I discovered that it had to function in lock step with the physics thread and so it too had to run at 16ms per frame. Reducing the framerate of the physics wasn’t an option as it was fixed to 16ms per frame - experimentation with varying that resulted in fluid collisions breaking horribly.

So, at that point we had the physics running at sub-30ms per frame, and the game thread running at 30ms and above. We were close, but we still had a long way to go - we now had to figure out how to optimise the Lua heavy game thread down to 16ms per frame. Actually, it had to be faster than that - as the PS3 has only two hardware threads, the 3 software threads have to share those two hardware threads which meant that the game thread plus the render thread had to be faster than 16ms per frame.

Then Ben discovered the reason for the teleporting fluros - a compiler bug triggered only in optimised builds was causing fluros to incorrectly calculate their trajectory during jumps, making them disappear. Turning off optimisation for that function fixed the problem. That was a tricky bug to find, so we were pretty stoked - until we realised that this meant that the number of fluros in view was now higher that it was before. Much higher. More fluros means more processing - more physics, more drops, more rendering, more audio. This one fix resulted in the Physics thread jumping back up to 50ms from around 24ms, the render thread jumping up to 12ms and the Game thread climbing to 40ms.

I was horrified. We were effectively back to where we had started. We now needed to save a total of 60ms over three threads. I wasn't sure we could do it. It was like that bit in horror movies where the protagonist has killed the evil monster, only to have it rise up, stronger than it was before.


The initial optimisations on a project are the easiest. As time goes on and all the low hanging fruit has been picked, the remaining optimisations are harder and harder to perform. We really had our work cut out for us - and we'd come too far to go back now.

Visualisation of the code and data changes made for the Vessel PS3 port.

Click here to read on...

tony@overbyte.com.au (Tony Albrecht) Programming Thu, 27 Feb 2014 23:11:19 +1030
Official Vessel Release Date! http://overbyte.com.au/index.php/overbyte-blog/entry/official-vessel-release-date http://overbyte.com.au/index.php/overbyte-blog/entry/official-vessel-release-date Overbyte and Strange Loop Games are proud to announce that the PlayStation 3 version of Vessel is due for release on 11 March in the US regions and 12 March for the European and Australasian regions. The official announcement can be found on the PlayStation Blog


So, for US$9.99 (or the equivalent in your local dollarcoins) you get over 10 hours of puzzley platforming goodness with a fluid system built just for you, lovingly ported to the PlayStation 3 by your caring friends at Overbyte.

But wait! That's not all - if you order through here, you not only get the PlayStation 3 version but the Steam version as well and, (while available) you will also get the Turbine toy. How cool is that?

 We're really happy to be finally releasing this game, and hope that you'll have as much fun playing through it on the PlayStation as we will. 


tony@overbyte.com.au (Tony Albrecht) News Thu, 27 Feb 2014 09:15:37 +1030
Looking For a Good Sort http://overbyte.com.au/index.php/overbyte-blog/entry/looking-for-a-good-sort http://overbyte.com.au/index.php/overbyte-blog/entry/looking-for-a-good-sort Vessel for PlayStation 3 has finally been submitted to Sony for approval so I have a little more time to write up some of the optimisation challenges we had to overcome during its porting process. Well, OK, I don’t really have any more time but I’ll try to write a bit more about it before wet bit rot sets in and I forget everything.


Vessel ran predominantly on three threads; one for Physics, one for Game and one for the Render process. The Render thread translated the output from the Game thread into a command buffer that could be parsed by the SPUs and as such was pretty short - we had allocated up to 3 ms per frame for it and unfortunately, in certain places it was hitting 11ms.

Looking at a Tuner profile we found the following


In this case, the Render thread was taking around 6ms, with half of this taken up with the generation of the quads for each of the fluid mesh maps. Of that, 1 millisecond was spent in the STL sort function. This seemed a little excessive to me, so I looked closer. The red bars in the image above correspond to the following code:


with the following functors to do the gritty part of the sorting.


Basically, what the above code does is, for each fluid mesh map (the outer loop code isn’t shown but the first code fragment is called once per FluidMeshMap) sort the FluidVertexes by water type (lava, water, glow goo etc) and then by whether or not it is an isolated drop (both edges in the g_Mesh are equal) or if it is an edge (e1 != e2). Once that is done this information is used to populate a vertex array (via calls to SetEdge()) which is in turn eventually passed off to the GPU.

From looking at the Tuner scan we can see that the blue parts take up a fair amount of time (they correspond to PROFILE_SCOPE(sort) in the code) and correspond to the standard sort() function. The question is, can we write a better sort than std::sort?

std::sort is already highly optimised and was written by smart people that have produced a generic solution that anyone can plug into their code and use without thinking too much about it. And that is exactly why we can write a faster sort than std::sort for this case - because we know exactly what we want. We have the input and we know the output we want, all we need to do is work out the fiddly transformation bit in the middle.

The output we desire is the FluidVerts ordered by sort and then by edge/drop type. The final order should look something like this:


I decided to base the solution loosely on the Radix sort. We do a first pass over the data to be sorted and count up how many of each fluid ‘sort’ there is. This is then effectively a histogram of the different fluid types - we now know how many of each type there is and we can set up a destination array that is the size of the sorted elements and we can calculate where in that contiguous array the ‘sorts’ of fluid should go.

Our second pass steps through the source array and places the FluidVertex into the correct bucket (based on fluid ‘sort’), but adds it to the beginning of the bucket if it is an edge and at the end if it is a drop. And that’s it. Done. Two passes. A super fast, super simple sort that does just what we need.

“So, how much faster is it then?” I hear you ask.

“About five times faster.” I would respond. This sort runs in about .25ms compared to the original 1ms but it also has the added benefit of moving the FluidVertex data into contiguous sorted arrays (the original just produced an index map that could be used to traverse the original array (vSorted)). This is good because we can then easily optimise the second part of the FluidMeshMap loop which copies the verts to a vertex buffer for rendering and also modifies the edges via calls to SetEdge() (for those that are interested, the SetEdge() function does some simple arithmetic and inconveniently calls Atan2() to set up the angle of the edge for the GPU to use). This contiguous data means fewer cache misses, but on PS3 it also means that we can easily pass it off to the SPUs to do the transform and copy for us while we process the next FluidMeshMap on the PPU. Nice huh?

The result is that the BuildFluidMeshmap section is reduced from 2.25ms down to 0.24ms - a saving of 2ms


The key message to take away from this is that generic implementations of algorithms are great for getting your code up and running quickly, but don’t expect them to be lightning fast in all cases. Think about what you are doing, reduce you problem to the simplest inputs and outputs and then build a simple transformation that does just that. It will make you feel good on the inside. Simplicity is its own reward.

(For anyone interested in more sorting goodness, check out an old blog post of mine A Questions of Sorts.) 

tony@overbyte.com.au (Tony Albrecht) Programming Thu, 30 Jan 2014 12:13:26 +1030
Vessel: Common Performance Issues http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-common-performance-issues It is time for me to write my final blog on the adventure that was, ‘Vessel PS3’, since my time at Overbyte is coming to an end. This project was tough, however, it is one I am proud we have conquered. There are a lot of interesting and cool optimisations that I myself have done whilst on this project, but since I won’t have the time to write a few blogs about them all, I thought I would highlight a few of the common things I found with some examples of things we did. I will leave some of the more interesting ones for Tony to cover later.

Using containers badly

There were quite a few places in Vessel in which we had to fix up the use of vectors, lists and other similar containers. Obviously you need to use the right ones, to do the right job by understanding what your code needs your container to do and then picking the right one.

But what if you choose the right container or a near enough one? There are still many performance traps you can quickly run into.

Let’s do a quick example. What is wrong with the following code?

vector<PipeEffectDefinitionItem> defs = m_EffectDefinition.GetRef()->m_Definitions;

for(int i=0; i<m_vfxInstances.size(); ++i)


   PipeEffectVfxReference& vfx = m_vfxInstances[i];



Did anyone notice the problem? We are grabbing a vector of these PipeEffectDefinitionItems from another object, but instead of using a reference, we are copying them. This happens in an update code of a particular class, which can be called a few times a frame due to other interactions. In each of them it was doing the same thing. Sometimes it is valid to copy a container, if you know the data might change or it is temporary. However in this case, we are referencing static data, so changing the first line to vector<PipeEffectDefinitionItem>& means we don’t need to allocate a whole lot of memory.

Memory allocation and de-allocation was a huge problem for us, with all the threads fighting each other. I have seen many code bases suffer from the same problems in my career. The things we found ourselves doing often in Vessel, was to attempt to work out the size of containers outside a loop, instead of just adding items to them inside it and causing lots of resizing of the containers.

Caching Data

Another common way to solve performance issues is to cache data. One of Vessel’s biggest function costs on the game thread, was a lookup which searched all persistable items in our game to see if it matched a HashID. The idea behind it being, that since some things are spawned at run time and some during load, they could safely reference each other and this system would look it up and let you access it, if it existed. This would be done throughout the code base for each frame and whilst the lookup itself was quite efficient, it was still expensive.

When dealing with a code base you don’t really know, the hardest thing is learning why things are done the way they are. Tony’s blog a few weeks ago talked about this. You cannot assume that the programmer before was stupid. They wrote the code the way it was for a reason. Sometimes they just wrote it in the most simple way and didn’t have the same needs/performance considerations you do. Sometimes the game/systems around it changed which means either a system is adjusted or the system is used in a way that it was not intended.

In this case it was clear that on PC the performance cost of looking it up each time was minimal to nothing, whereas the effort required to track object lifetimes, caching and removing cache, handling level transition and everything else would result in such an insignificant gain it was never looked at. For PS3 however we needed to cache these lookups! Here is the original call to GetRef.


And the cost of the function over a typical frame:


Caching this value required a few changes and a few assumptions. We knew that all things referenced by this system were all persistable, so we could add our required call back functions for both destruction of the persistable, as well as the removal of the reference to it on that class. We also assumed that the constructor would only be used to create new items, not used to re-assign items. This meant that we knew that we never had to check the cache in constructors. For all of our assumptions we had to write code to catch when we were wrong and adjust lots of other code around these. But in the end, what we were able to do is change GetRef to:


And now looking at its performance, we see a 0.5ms improvement.


Inefficient Data Searching

The first thing you should probably do in a code base when looking to optimise is to search for something like TODO: Make faster. Chances are it is probably slow. Whilst I am kidding, I did find one such comment after noticing a function using a lot of time via hierarchy view. This function was used to find both functions and properties so that lua and the game code could talk to one another. Here are the pieces of code in question.


There are two inefficient things in this code. Firstly we are using a vector to find the functors and properties, where we never need to iterate through the list, well at least not in any time critical, runtime code. Secondly we are doing a string comparison, ignoring case. These aren’t cheap either.  Here is a summary of a profile of this code running in one of our levels.


The solution here it two fold. Firstly we should be using an associated container as we know that

a)      The container is only added to as a start up cost, which sets up the links between the strings and the actual functors/properties.

b)      The time critical part of this function is a lookup based on a key

c)       We almost never do iteration through all elements (except to find the key as above)

Secondly we know that the string compare is not fast. In a case like this you want to convert it to a hash which will be a lot faster. The Vessel code base already uses hashes instead of strings in most places, but this one was never changed. So we now have the follow code:


The difference we can see below, a small saving, but the key thing in the max is reduced, so less peak slow downs. It is always good to review any functions which are called a lot, especially if they seem to be taking a lot of time in comparison to the work they should be doing.


Doing things unnecessarily

If you don’t understand a codebase, like in our case when we started Vessel, knowing what each function and system is trying to do and stopping it from doing things it doesn’t need to do can be extremely difficult. Now that we are at the end of the project, we are pretty familiar with the code and thus have found many areas where the code was doing things it didn’t have to, or being overly complicated.

For instance in Vessel there are objects called Water Instances. There is one water instance for each specification of liquid in the game. There could be several specifications of Water or Lava. The problem was the water instance code also managed sounds and lights for the liquids based on flow, collisions and other information. This code would find all liquid of a liquid type (Water, Lava, Red Goo) and play/stop sounds and generate lights for them. As we could have up to 9 water instances of a single type, this meant the same liquid was processed 9 times and 9 x lights and sounds were being played.

Obviously the issue here was that originally water instances and water types were a 1 to 1 relationship, but as the game grew, this moved apart and as there was no performance impact on PC, it wasn’t revisited. But on PS3 this was causing us huge issues, going through all the liquid in the game is slow, so to do it 9x more then necessary was bad. Also fmod was not appreciating the heavy load it was being put under and the GPU was struggling drawing 9 lights in the same location. Here is the profile:


We simply just cached if we had done the lighting and sounds checks for a liquid type each frame. If we had, we didn’t do it again. This did have some repercussions, as the content as all tweaked around this. So we had to increase sound volumes and some other settings, and the intensity of lights.

Further inside this class it was handling collisions and keeping a list of data about the interactions between drops of the fluid. This list was reusing a pool of data, to stop memory allocations and de-allocations. The list was checked each frame to remove any which were now obsolete and then checked for room to add new records later.

This code again was quite slow, as it was going through 100 records per 35 water instances. And most of the time, as most water instances do not have liquid on screen, there was no records to clean up. However the code did not know this and continued to check each empty record. Thus we did a minor optimisation to this, making the object know if it had any records being used. This way the code did nothing for the majority of the water instances.

This was a much smaller saving, but an indication of some of the real simple changes you can make to stop code doing things when it doesn’t need to. Let’s have a final look at the water collisions:



Anyway I hope there was some interesting information in there for you. Thankyou to everyone who took the time to read one of my blogs. Tony has heaps more in store for you, so watch this space!

ben@overbyte.com.au (Ben Driehuis) Programming Tue, 31 Dec 2013 09:33:04 +1030
Vessel: Where is the memory going? http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-where-is-the-memory-going Going into Vessel we (wrongly) assumed that memory usage would be not much of an issue, at least in terms of peak memory usage and getting it running on a console. However we soon learnt that the initial build of Vessel we had could not even load the main menu or any of our smallest levels without running out of memory on a retail PS3.

The first step is to find out what memory we are actually using and where is it going to. This might sound obvious and simple, but depending on your codebase and what engine and middleware you are using, there may not be any easy ways to find this out. With Vessel using its own custom engine written for a platform where memory isn’t a great concern, we had no tools to start with.

The PS3 has two areas of memory to use, 256mb of main system XDR memory and 256mb of RSX memory. This memory is completely separate from each other, with the Cell CPU being able to access the system memory fast, whilst the GPU accesses the RSX fast. Both processors could access the other area of memory but at a performance cost. Some good information on this can be found here, a few paragraphs down: http://www.playstationlifestyle.net/2013/02/27/ps4-memory-analyzed/


So we started trying to track down our usage. We used our RSX allocator to track current and peak memory usage and used sony api calls to track our system memory usage. What we soon learnt got us a little worried. We were way over memory on system (383mb used on our smallest level!) and running out of memory on RSX for most levels.

Middleware wise, we use FMOD, Lua, Boost and TinyXML. FMOD has some easy built in API to get memory usage and surprisingly I learnt that it was using 180mb just a minute into our gameplay… Thankfully this was a case of bad data, as the PS3 FMOD project had not been set to stream the music! This reduced it to a more reasonable 80mb, still high, but since all sounds which aren’t music are being loaded into memory, it was at least what we expected. So where is the other 200mb of system memory going?

Whilst we had no internal memory tracking tools, thankfully the PS3 does have a third party solution to tracking down memory called Heap Inspector. This solution developed by Jelle van der Beek and is free for PC and PS3 developers. Setting this up is quite straight forward, it just requires you to wrap your memory allocation and deallocation calls. An easy way to do this, if possible, is to use the link-time wrapping feature (--wrap). You can find the website for Heap Inspector here http://www.heapinspector.com/.


This gave us a very good rundown of exactly where the system memory was going. A big chunk of our memory was being used in the tinyxml system, which seemed strange. Since we were on quite an old version, we decided to upgrade (as there were both significant performance and memory improvements). As this involved a big api change I had to rewrite most of the xml interface code, which actually lead me straight to the major memory problem, after loading we were not clearing out the document handle properly, which meant the file buffer was being kept in memory. Since this was all raw xml, this was between 15-20mb per level, and our global data was another 30-40mb.

So now we can load most of our levels, but we couldn’t transfer between most of them. Further analysis with heap inspector showed us that fmod memory was increasing as the game went on and that a big chunk of our system memory was there to catch the rsx memory overflow. Again the fmod one wasn’t too hard to track down, originally Vessel didn’t have to worry about memory, so it never unloaded any sound data. Even if that sound was never played again. Whilst doing this after it finished playing would be bad and cost us performance (as when the sound is played again it would have to be loaded), on level transfer it was safe to clear our all the fmod event data. This meant it wouldn’t grow as the game went on.


Now we needed to solve the rsx overflow and looking at the tracking we had added ourselves, all of that overflow was due to texture data. Most games have texture streaming and a texture pool, but Vessel does not and just loads all level data. The levels are smallish though so we shouldn’t be using so much RSX! I soon discovered we were only overflowing on level transition and that’s when I worked out what was happening. On transition, we first load in the new level and data and then went through the old data and only unloaded anything that wasn’t reused. This is a nice optimisation for load times when you have memory to spare, but we didn’t have that luxury. Was there a way I could keep this in and not overrun our memory?

The good news is that the actual binary data and the data object were separated and thus what I did is change the system so it queued up the loading of the actual textures, but still loaded in the normal data objects and thus could still compare if we needed any of the texture and other asset data. Once we had unloaded all the not needed data, we then loaded in the new textures. Now we never went over RSX memory and we could get rid of our overflow!


The final memory issue we had was that sometimes loading our largest level we would run out of memory. This was because again it was using XML and the way the serialization system worked is that it read it exactly as an xml file. Thus we needed to allocate the file size buffer. Not always possible! So we had a few choices, change the serialization system so it could stream it in or find a way to reduce the file size.

Since TinyXML supported both compressed and binary xml data, which greatly reduces the size (as well as splitting the data over two files, instead of one large one), we decided to do that first. This reduced our largest level from needing a 21mb buffer to needed a 7mb and 3mb buffer. That was a lot more manageable and turned out to be more than enough for our needs. The nicer way to go would be to stream the data in and not need the buffer at all, however in our case we needed the fastest and least intrusive (since we need the PC version to work so we could access the editor).

So those were the major issues we had with memory usage. A lot of them are very specific to the issues we ran into, but the key things to remember are:

  1. You need to find exactly where the memory is going
  2. Use tools where possible, you don’t always have to re-invent the wheel
  3. For middleware, check for API calls, check updates and talk to their support
  4. Never ever trust the content!


ben@overbyte.com.au (Ben Driehuis) Programming Fri, 13 Dec 2013 10:18:00 +1030
How Smart Programmers Write Stupid Code http://overbyte.com.au/index.php/overbyte-blog/entry/how-smart-programmers-write-stupid-code http://overbyte.com.au/index.php/overbyte-blog/entry/how-smart-programmers-write-stupid-code This is article is based on the presentation I did at GCAP 2013. You can find the slides in Keynote and Powerpoint formats here. The slides include extensive presenter notes, so you should be able to understand what I was getting at even though the slides are purposely light on content.

I've spent a significant part of my professional career working with other people’s code. I've worked with code from big studios and small ones, from successful ones to struggling ones, with experienced devs and green. I’ve seen code that is beautiful, code that inspires as well as code that makes no sense, code that stinks, code that is impossible. I've seen code you wouldn't believe. 


But all the code I’ve seen, both good and bad, has one thing in common - it was written the way it is for a reason. Part of my job, and the job of any code maintainer, is to find that reason and understand why the code was written in that particular way. 

When confronted with new, complex code, there are 4 stages that a programmer goes through:

  • [horror] WTF is this?
  • [confusion] How did this ever work?
  • [blame] I hate you and I wish you would die.
  • [enlightenment] Ahh, I get it now.

The forth stage is the important one. Enlightenment. Comprehension. When you finally understand the code completely and it becomes your own. All too often if the function of some code isn't obvious, then it is often dismissed as being stupid - but this is rarely the case. The code in question isn't necessarily stupid, but it is obscure. Complex. Some of the smartest people I know are programmers, yet they can write some bloody awful code. Stupid coders don’t last long in this industry and their code doesn't last much longer than they do. So why do smart programmers write stupid code?

There are two ways of constructing a software design: 
One way is to make it so simple that there are obviously no deficiencies
and the other way is to make it so complicated that there are no obvious deficiencies.

 — C.A.R. Hoare, The 1980 ACM Turing Award Lecture

Time Constraints

Time constraints have a significant impact on the complexity of our codebases. There are two types of time constraints: the time available to write the code in the first place as well as the time available for the code to complete execution in. 


Time to get the code running

The first constraint is a very common one - game devs are time poor. We generally have only just enough time to get the code working in the first place - who has time to revisit their code just to clean it up? We've got a game to ship, right? Personally, I’ve seen some awful game engines ship great games - and conversely, great engines that have never shipped a game. So, back to my point; given a limit in time to develop the code, what is often produced isn't always the cleanest or simplest solution. When that code is revisited for bug fixing or some other modification, its opacity can mean that it won’t be touched - rather it will be worked around or removed completely as the new coder may not understand why or how it has been done.

Time for the code to run in.

The second constraint is one we've all seen too. A portion of code was recognised to be a bottleneck and so it has been optimised into something unrecognisable. In its optimised state it is difficult to debug, difficult to add to, difficult to understand. 

Both these forms of time constraint can result in complex, hard to maintain code. ‘Stupid’ code. Code that wasn't necessarily written by stupid people, but taken at face value is at best, obtuse.

Code Entropy

Code is like a thermodynamic system. Over time it tends to a state of disorder. In fact, given enough time it tends to a state of perfect internal disorder (which is a very apt description of most shipped games’ codebases). Disorder within a codebase manifests as complexity (and instability) which, if not actively resisted, increases over time. 


There are a couple of factors which contribute to this increase of complexity;

Problem Domain Evolution

The problem you end up solving is rarely the one you set out to solve.

Consider a programmer who sets out to solve a problem. She settles on an appropriate approach an initial solution is implemented. At a later date the initial problem is deemed to be incorrect or requires incremental modifications and so you she has to readdress the problem. The new direction is similar enough to the old one that it makes sense to simply modify the existing code - so she slowly twists the original solution to fit the new problem. It works, it’s a little ugly, but it works. If this happens enough times, the final problem can be significantly different to the original one and the initial design may not be as appropriate any more. Or even remotely appropriate. But it works. 



The issue here is the evolution of the code. The programmer has designed a clean system to solve the given problem and that problem has changed so she needed to modify, or dirty, the system. Edge cases and singularities can pop up which can dirty the implementation as well.

So what you have is an evolving system which gets more and more complicated as the problem domain changes. This is even more evident over longer periods of time leaving you with dead code and data. As new programmers are swapped in and out to iterate on the code things get progressively worse.

 Solution Evolution

Iterating without readdressing the initial design convolutes the eventual solution.

A programmer starts with a given problem. She designs a system to solve this problem and then implements that solution and tests it to see if it solves the problem. If it does, then the problem is solved, the code is checked in and beer is consumed in a celebratory manner. If not, then the implementation is revisited and iterated upon until a working solution has been achieved (possibly while everyone else is at the pub having beers). 



The issue here is that the initial design may not have been the most appropriate for this problem - the programmer learns more about the problem as she implements and iterates on it, but rarely revisits the initial approach to the problem. The more work that is put into a particular approach, the less likely the programmer is to change it. The ideal time to design the code for this problem is once she’s finished implementing and testing it.

What we end up with is sort of like DNA - there is a working solution there, but there is also a lot of cruft left in there from other iterations which has no effect on execution (except for when it does). It’s hard to tell what is part of the core solution and what is cruft - so what happens is nothing is touched when changes need to be made. New code is added to the periphery of the solution, often duplicating what is already in that original solution.

Complexity has momentum

The problem with complexity is that it breeds more complexity. The more complex a solution is, the more complex modifications to it become, either through necessity (the complex implementation requires complex changes) or convenience (it’s too hard to understand so its easier to tack something onto the existing solution or even to rewrite what is there). 


Controlling complexity is the essence of computer programming.
    — Brian Kernighan 


It takes a real effort to simplify a complex implementation - a coder needs to completely understand the existing solution before she can simplify it. The urge to rewrite a complex system is strong, but unless you fully understand the existing implementation you run the risk of making the same mistakes the original programmer(s) did.


Any intelligent fool can make things bigger, more complex, and more violent.
It takes a touch of genius — and a lot of courage — to move in the opposite direction.



In closing

My recommendations are this; Treat existing code and previous coders with respect. In all likelihood it is the way it is for a valid reason. Once you understand why the code was written the way it was, you are then in a position to modify, simplify or optimise it - and when you do, leave the code in a state which is better than when you found it. Your future self with thank you.

Where possible, take the time to refactor your code once it is working. Simplify it, clean it, make it beautiful. Beautiful code is its own reward. 


tony@overbyte.com.au (Tony Albrecht) Programming Thu, 21 Nov 2013 21:16:57 +1030
Vessel: Lua Optimization and Memory Usage http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage http://overbyte.com.au/index.php/overbyte-blog/entry/vessel-lua-optimization-and-memory-usage Hi everyone, my name is Ben Driehuis and I am a senior programmer at Overbyte.

Vessel uses Lua to do it's glue gameplay code, allowing it to use most of the systems and tech built up throughout to form specific gameplay related classes. When we first looked into the performance of Lua and Vessel in general, we noticed very quickly that all of the threads were doing a lot of memory allocation and frees each frame which meant there was actually a lot of contention between the threads.

Lua has its own incremental garbage collector and is performed automatically. The collector runs in small steps and run time is more or less based on the amount of memory allocated. Basically the more memory you are assigning in Lua, the more the garbage collector will run. You can also manually run the garbage collector in many ways, forcing a step or doing a full run (which is not cheap). The full run through is very handy for the cases where the user won’t notice an extra 10-30ms in a frame - a good example of this are at the start/end of a level load or when showing a pause menu or some other game pausing interface.

On console memory is always a huge factor, with most optimizations involving some trade-off between memory and performance. However in a lot of cases the trade-off is heavily skewed one way or the other. In vessel our garbage collection time was quite busy. We were forcing one step of the GC per frame, after our gameplay thread had done the work needed for the physics thread to do its thing.  The result of that in this particular scene was:


An average time of 2.6ms is pretty good, but the maximum of 7.4ms is a huge concern! As we learnt above, the step time is very dependent on the amount of new memory being used each frame in Lua. In addition to the manual GC we have Lua doing its normal background incremental garbage collection during execution. From the below you can see it working away during the game thread, indicated by the calls to propagatemark (See Tony's last article, A Profiling Primer, for more info on Tuner)


Looking at our markers we quickly noticed that our processor Lua scripts were also quite expensive in this scene (the two pink bars in the above image). This marker is the main place where our gameplay world objects which change each frame exist. Thus it is also the part where the most Lua is called. Let’s have a look at the script processor update performance currently. Looking at the image below, we can see it has an average of 7.1 ms. That is not good at all!


We then started investigating what short term allocations we were making inside of Lua to try and reduce our need to do the garbage collection. Our aim was to turn off the automatic background garbage collection and just run our step at the end of each frame, with that step not taking too much time. We used third party code from Matt Ellison, who is Technical Director at Big Ant Studios to track down and display the memory we were allocating and freeing each frame. A big thankyou to Matt for allowing us to use this.

So, in this scene we determined we had about 30kb per frame in temporary Lua allocations. That’s a lot and will quickly add up! Looking at the output from Matt's tool, we saw that 20kb of that allocation came from just one Lua script, Bellows.lua,  and all of that inside its SetPercent function. The output showed a lot of temporary Vector2 construction, including a few kb from our Utils.lua VectorRotate function called inside it.

Let’s take a quick look at those two functions.



Well I think we can now see where all that Vector2 construction is coming from! Inside that for loop, we are creating three Vector2’s per iteration and then the call to VectorRotate in the loop creates another one. So that’s five per loop through on each bellow.

We can also see here there are some unnecessary calculations. Bellow.GetRotation + this.OffsetRotation is done twice without either of those value changing in each loop. We also know that GetRotation is a call into native code which is more expensive then accessing a Lua member. Material Scale is also calculated in each loop, even though it never changes value. So instead of creating all of these Vector2’s inside the for loop, lets use local variables and remove the call to the VectorRotate. This gives us:



A good performance improvement on average and looking at the temporary memory use of Lua we now saw a maximum of 23kb a frame, a 7kb saving and an average of 20kb. Our output was now not showing any Vector2 allocations, we had reduced our memory waste by just under 25% and, as we hoped, our Lua was running faster.

We still have some temp memory each frame created for each bellow and ring which we can make into a local for the function. We can also store the reference to the maths functions as locals. This will give us a small speed up in performance.

Why is this? Because Lua allows each function to use up to 250 of its own “registers”, which means Lua stores all local variables into these registers. For non-locals, it first has to get the data/function and put it into a local and then use it. If you are calling or creating something once in a function, obviously making it a local will not help as you have the one time lookup cost. But in a loop, you are saving each subsequent call! So let’s run with the following code:



A huge performance win! Memory wise though we aren’t really seeing any big improvement, still using a maximum of 22kb of temporary memory a frame and an average of 19kb. At this point I decided to find other areas heavy in bellow usage to see if this would assist me in narrowing down where all the waste is.

This brought me to the start of one our levels, where the camera zooms out over a largish area. This was showing an average of 20kb a frame with a maximum of a massive 38kb! And 17kb of that was again the Bellows. The data showed lots of memory use by calls to GetMaterial and GetLocalRect. Performance wise for our ScriptOnProcessorUpdate, we were pretty close to the above image in terms of average, at 4.33ms, but with a higher maximum of a massive 8.6ms and lower minimum at 2ms.

Looking at the code we now have we can see two interesting facts regarding the use of GetMaterial and GetLocalRect. Firstly both are called on unique items once in the loop, so we can’t cache it. However both are used to then call a function on them. That’s two calls into our native code and a temporary allocation of an object pointer and a Rect. So what happens if we make two new functions in native code to let us do it in one call, without the temporary assignment?



Another good win due to less calls between Lua and the native code and less temporary allocation. We now have an average of 14kb of temporary memory usage per frame and a maximum of 20kb for this new scene. A whole lot better! Our bellows here are now allocating a maximum of 9kb.

We can also remove the offsetVector, as we know the X component is always 0 and so we can cut down the position calculations, as we know doing mathcos(bellowOffSetRotation) * offsetVector.x will give us a result of 0, so the mathcos is a waste of time. Thus those position calculations can be simplified to:

position.x = posBase.x - (mathsin(bellowOffSetRotation) * posOffset)
position.y = posBase.y + (mathcos(bellowOffSetRotation) * posOffset)

This doesn’t really save us much temporary memory as we are just saving on one float, but it is still a slight help and the performance is slightly better. At this stage we are getting pretty close to an optimised function without changing things significantly. So let’s take a relook at the performance of the garbage collection step at the end of each frame now.


So our average is down by 0.5ms and the maximum is lower too after dealing with just one of the lua classes. Add that to the 3.5ms saving we got in the processor update itself, which some of it will also be due to garbage collection savings as we are still running automatic background GC, that’s a huge saving on the game thread.

There are a lot more areas like this in the code base where we should get similar wins and will greatly reduce our lua memory churn. The golden rules to remember are;

  • use locals when you use something more than once a function
  • try to cache of data from native code if you use it more than once
  • make condensed functions to again reduce the calls to native code. Calling one native function which does all the work is much better then calling several native functions to do the same thing.

The next step is to turn off the background garbage collection and only run it at the end of the frame for a certain amount of ms. This will require us to keep an eye on the memory usage of lua and ensure we are doing enough cleaning up to mean we can get away with limiting the garbage collection time.

There is more information about Lua performance in the article titled Lua Performance Tips by Roberto Lerusalimschy (http://www.lua.org/gems/sample.pdf).

Once again, a huge thanks to Matt Ellison and Big Ant Studios for letting us use their Lua memory tracer.

ben@overbyte.com.au (Ben Driehuis) Programming Wed, 02 Oct 2013 08:55:00 +0930
A Profiling Primer http://overbyte.com.au/index.php/overbyte-blog/entry/a-profiling-primer http://overbyte.com.au/index.php/overbyte-blog/entry/a-profiling-primer The most interesting thing about this port is the optimisation feat ahead of us. Vessel was already (mostly) running on the PS3 at the start, but needed to run much, much faster. So, where does one begin? The first thing you need to do, is profile it - if you don't know where the program is slow (or why) then you can't optimise it. But in order for me to be able to talk about profiling and optimisation, I'll need to delve (briefly) into the profiling tools on the PS3 (If you're an experienced PS3 dev, then your probably familiar with this information - skip ahead, I won't mind). Here's a screen capture of part of the tool I use for PlayStation3 profiling (SN System's excellent Tuner - comfortably the best profiler I've used)


Like all the best profilers, it has lots of colourful bars (the PS2 Performance Analyser was awesome like that too). Colourful bars are crucial for dazzling and confusing artists who may wander by and happen to glance at your monitor. The top of the screen has a time line in seconds - you can see it ends at 0.073, or 73ms. Yup, that's right, 73ms. (Note that in this blog I will almost always use milliseconds rather than frame per second. Frames per second is a useless metric only used by teenage boys who like to put lights inside their PCs).

The display is broken up into roughly 3 parts that we can see here. Each group of horizontal bars corresponds to a thread (click on the image to see it in higher resolution in another window). The top thread is the Physics thread, the next the Game thread and then the Render thread. 


The top, segmented bar is the overall thread usage. When the thread isn't running (like at the start during the stall for sync) you'll see a gap or a bar indicating an OS function (like a mutex or even fileIO). The colourful collection of bars beneath that are the result of marking up important functions in the code base with calls that tell the profiler when the function starts and stops. The longer the bar, the more time it takes. Bars that appear under other bars indicate function calls from that parent. For example, WaterTime calls MoveWaterFromInternalPressure which calls ApplyViscosity and the functions that call DoubleDensityRelaxation and then ProcessSkeletons. It is a very powerful way to view the flow of your code and I often find myself using Tuner just to see how a foreign codebase flows. For those of you out there who don't have access to the PS3 devkits or Tuner, you can use Chrome Tracing to read the output of your profiling in JSON format and display it in your browser (assuming of course, that your browser is Chrome).

So, basically, the longer the bar the longer the function it represents takes to run. Now, how do we choose which bar to optimise?

Generally, when looking for what to optimise first, I try and find something easy, something obvious. Occasionally you'll get lucky and make a simple fix that will save milliseconds - check your compilation options (you wouldn't believe how often people miss one library/project), check that your code is inlining properly (check the generated assembly), look for lots of string based functions as well as intra frame allocations - basically make sure that everything you're seeing makes sense and that there are no piles of stupid lying around. There are two ideal candidates for optimisation: One function that does a lot of processing or one or more small functions that are called a lot.

There is another feature of Tuner (and most other good profilers) and that is Program Counter sampling. Program Counter (PC) sampling is done by the profiler reading the program counter register to get an indication of which line of code is being executed at that time. The more times a particular line is sampled, the more often it is being executed - on PC the free Very Sleepy does a good job of this. You'll either get a hierarchical view of the functions being called, or a flat view of what is being called when - like in the following image:


The red and blue marks correspond to samples that are performed during the frame. On the left you can see the names of the functions that the samples have occurred in. In Tuner you can also drill down into the actual functions and see which lines and even which assembly instructions have samples attributed to them. This allows you to find hotspots - sections of the code that are called a lot and contribute heavily to the cost of each frame.

Tuner has a lot of additional functionality that I won't go into here - this should be enough to help you to understand captures that myself and Ben will post over the upcoming weeks.

In my next post I'll be looking in detail at the optimisation of one of the sections of the Physics thread. I'll just say this; my initial hope was that I could gain a lot of performance back by simply rearranging memory and minimising memory allocation in the physics code. That hope was very quickly and resoundingly dashed - this wasn't going to be an easy task.

tony@overbyte.com.au (Tony Albrecht) Programming Fri, 27 Sep 2013 09:15:00 +0930
Any Port in a Storm http://overbyte.com.au/index.php/overbyte-blog/entry/any-port-in-a-storm http://overbyte.com.au/index.php/overbyte-blog/entry/any-port-in-a-storm The Overbyte team has been busy beavering away on a number of projects over the last few months, but there is one that I thought might be of interest to a few programmers out there. We recently started work on a PlayStation 3 port of the successful Indie PC title, Vessel, and what we plan to do over the coming weeks is to document the porting process. We'll look at what we started with, what our goals were, what assumptions we made, what mistakes we made along the way and have a look at some really gritty optimisation problems.

tony@overbyte.com.au (Tony Albrecht) Programming Sat, 14 Sep 2013 22:21:10 +0930
Optimisation Lesson 3: The Memory Bottleneck http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-3-the-memory-bottleneck

Welcome to Lesson 3 of the Overbyte Optimisation series.

If you cast your mind back to the end of lesson 2, we were left with the following code and corresponding samples:

(Note that the samples here are not exactly the same as the ones in the previous lesson, even though the code is. This is due to a change in the version of the compiler being used – we now have faster, more interesting code)

From looking at the source, there is nothing obviously expensive in the calculations at the highlighted points – no branches, square roots, divides or other non-pipelined instructions. So, there must be something other than instruction latency causing the performance problem.

Lets have a look at the disassembled code and see if that sheds any light:

I’ve highlighted the two biggest samples – almost an order of magnitude larger than the other samples (note that the order of the samples is swapped – this is a good example of the compiler reordering instructions into a more optimal format). From the last lesson, we know that fadds and fmadds are only 10 cycles long, so that shouldn’t cause the large number of hits. Let’s look at the C code in relation to the generated assembly:

In the optimised code the y position is first loaded, followed by the X velocity and Y velocity, then the x velocity is added to the already loaded x component of the acceleration value. It’s this addition that seems to be taking most of the time –  (we’ll look at the reason for the second stall later in this lesson). What is actually happening is that the fadds instruction is stalling, waiting for the x velocity to be loaded into the f10 register as the addition can’t take place until it has the correct data in all the appropriate registers. The large number of hits on this instruction implies that this memory load is taking a very long time indeed.

So what’s involved in loading memory into the CPU registers?

Memory access

In order to load data into a register on the CPU it must be transferred from the memory subsystem. Back in the good old days, memory access times were comparable to instruction times – each in the order of a few cycles. Over the following decades improvements in CPU speeds followed Moore’s law increasing their performance logarithmically while memory access speeds crawled along, improving performance by only 10% per year.

Comparison of relative speeds of processors and memory.


In order to reduce this memory bottleneck, CPU designers added a small amount of very fast, very expensive memory called a cache. These caches come in different flavours – Level 1 cache  is the fastest and smallest and is generally found on chip. It can only pull data from Level 2, which is larger and slower while level 3 cache (if your CPU has one) is larger and slower again. The highest order cache in your system is the one that retrieves data from main memory. Each element in a cache contains not only some data that corresponds to a copy of the data in a location in main memory, but also information on where that data comes from. Each cache can only access data from the level below (or main memory if its is the highest order one). A great explanation of how the different types of caches work can be found here.

In the case we are looking at, the instructions request some data to be loaded into a floating point register. The CPU will first query the L1 cache for that data and if it is already resident there, will return it within a matter of a few cycles. If it is not there then the L2 cache is queried and if it is present there then it will be returned as soon as possible, which in this case  (on the PS3) will be around 30 cycles. If however, the required data isn’t in the level 2 cache, then main memory must be accessed which takes hundreds of cycles (somewhere between 400 and 600 on PS3 and X360). This is an eternity as far as the CPU is concerned – hundreds of instructions can be executed in the time it takes to pull data from main memory.

If you had a single instruction that could take 100s of cycles, how often would you use it?

The following interactive page illustrates the relative times for accessing data from L1, L2 and main memory. Click on the different memory subsystems to see  how slow main memory is. (Page requires HTML5)

Interactive demo of relative cache/memory access times

As you’ve experienced above, retrieving data from main memory is very slow – painfully slow. (In this demo, the latency for the different gaps are; L1$ -> CPU: 4 cycles, L2$ -> L1$: 30 cycles, Main Mem ->L2$: 600 cycles). An L2 cache miss is in the order of 50 times slower than the average instruction latency.

Minimising the memory bottleneck

So, what can we do about it? We need to use memory, so how do we minimise the impact of these long reads from main memory?

The simplest way is to just use less memory – minimise the size of your regularly used (hot) structures. Do you really need to an array of rarely used data (eg char name[32];)  in the struct or can you just maintain a pointer to it? If it is a large structure, can you break it up into hot and cold parts, or at least parts that are used in sequence?

In the particle system case we have here, these aren’t really options (unless you want to drop down to 16bit fixed point arithmetic). Another alternative is to make sure that the data you are accessing is contiguous: each cache load will read in an entire cache line (on the X360/PS3 this is 128 bytes), so you should make sure that all of the data being read is actually being used. Traversing an arbitrarily allocated linked list is a pathologically bad example of this, where each read of the next pointer is most likely a cache miss. Traditional scenetrees are another example of bad cache behaviour, but there are ways around both of these cases (I won’t cover it in this article, but you can see an example of a scene tree optimisation here in an article I wrote a couple of years ago). In the case we’re dealing with here, this won’t help either as all of our data is already in nice flat arrays and nothing is read that we don’t use.

The third optimisation is to use a special instruction that will start fetching some data for us, ideally well before we need it so that when we do need it, it will hopefully be in the L2 cache ready for us. This is called prefetching and, when used effectively, can have a dramatic effect on your code’s performance.


On the PlayStation3, the prefetch instruction is __dcbt(loc), where loc is the location of the data to be prefetched. It triggers a load of a single cache line (128 bytes)  into L2 cache. Let’s add one into our loop and see what happens:

Well, our stalls seem to have vanished, but what about the performance? The original code was running at around 10.8ms – this code now executes in 4.5ms! Nice! We’ve sped up our code by a factor of two by adding an instruction.

I chose (p+10) as the location to prefetch as the inner loop is about 40 cycles and fetching 10 particles ahead corresponds to 10*40 = 400 cycles which is approximately the time taken to fetch data from main memory. The graph below shows the different  times taken for different prefetch distances:

The first column corresponds to the code without the prefetch instruction, while the second column prefetches the particle that we are about to use in that iteration – note this indicates the cost of the prefetch instruction is non-zero, but very soon the effect of prefetching mitigates that cost. At 8 to 10 particles look ahead (192 to 240 bytes) we start to plateau. Now it is worth noting that the __dcbt() instruction loads in 128 bytes at once, so when we’re calling __dcbt() within our loop, we’re actually prefetching the same cacheline 5 times. This is still cheaper than testing if the next particle to be prefetched is in the same cacheline or not (if we unrolled the loop to do 5 particles (5 * sizeof(MyParticle) = 120)  at once we’d get better prefetch distribution (as well as pipeline utilisation), but we’ll look at loop unrolling in a later lesson).

The reason for the dual stalls

The reason for the dual stalls in the original code is due to the size of MyParticle relative to the size of the cache line. When a single particle (MyParticle is 24 bytes in size) is loaded, an entire 128 byte cache line is loaded. This means that 5 particles will be loaded in with that one, as well as part of the next one.  The following diagram shows the distribution of particles over 3 consecutive cache lines (the pattern repeats every 3 cache lines).

The first case where the struct aligns exactly with a cache line results in a hit on the first stall as the x velocity is used. The next 4 particles are all within that cache line and so do not incur another cache miss penalty. The first miss happens loading the 6th particle and contributes to the first stall once again – py is in cache, vx is the first miss.  The next miss contributes to the second stall as vx is already in the cache and the processor has to wait while vy is fetched from main memory. This behaviour matches our observations with approximately 2/3 of the hits being on the first stall and 1/3 on the second. When we added prefetching both of the stalls were removed (as we would expect).


What we’ve seen here is how to identify and fix a simple system which is preforming sub-optimally due to cache misses. The memory used is already contiguous and homogeneous, so prefetch look aheads are easy to implement. More complex memory access patterns can be considerably more difficult to optimise, so I would suggest that one of your first passes at optimising a system (after algorithmic optimisation) is to simplify your memory access. Determining what data is used in a tight loop and then ensuring that data is stored in a cache friendly fashion will not only give you an instant speed boost, but will make it easier to prefetch and thereby optimise further.

When designing a system that needs to be fast, it is critical that you consider the layout of the memory being used. Optimal code is irrelevant if your data is badly organised.

TLDNR: put all your data in flat arrays.

Coming up next

Now that we’ve removed the memory bottleneck, we’ll next have a look at further optimising this code by addressing the instructions used. In particular, we’ll start using some SIMD instructions and see how that impacts our performance.

Until then, remember, it’s all about the memory.

rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Fri, 23 Mar 2012 15:46:00 +1030
Optimisation Lesson 2: The Pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-2-the-pipeline

Before we start on the second in this series of optimisation lessons I’d like to make something clear. The type of optimisation that I am predominantly talking about in these lessons is low-level optimisation, or micro-optimisation.  This type of optimisation should only be applied once you are sure that it will definitely have a benefit, once you have exhausted all obvious higher level macro-optimisations. There have been (too many) times in my coding career when I have spotted some obviously inefficient code and spent too much time joyously optimising it only to end up with rigid, highly optimised, unmaintainable, very clever code that makes absolutely no difference to the overall performance. Or, even worse, it runs slower.

Your first task in optimisation is to figure out what is slow, then why that something is slow (and sometimes when) and then remedy that.

Understanding the sample distribution

If you can cast your minds back to Lesson 1, we had profiled our particle system and had started examining the PC samples obtained from our profiler as depicted below.

I’ve highlighted the two lines with the highest hits and we’ll start by focusing on the second, biggest one first.

    blt 0x00011B30

From the interactive C++ to ASM page from the last lesson we can see that this is a branch command and corresponds to the following C++ code:


To understand why this instruction is such a bottleneck we need to understand a little about how the underlying hardware functions – in particular the CPU pipeline. Modern CPUs can issue a command or two every cycle. These instructions take a variable number of cycles (depending on the instruction) to produce an output, but (and this is important), multiple instructions can be in flow at the same time – each one at a different stage in the pipeline.

For example, consider the following code which does 3 single precision floating point adds:

    fadds f4,f4,f0
    fadds f13,f13,f30
    fadds f0,f31,f29


If each fadds instruction takes 10 cycles, how many cycles does it take to complete these three instructions?


The answer, as with most things when you get deep enough, is “It depends”. In order to simplify this problem let’s assume that the registers that are being used already have the relevant data in them – this way there is no stalling as the fadds instructions wait for the registers to fill from D cache or main memory (we’ll deal with that can of worms in the next lesson).

The following diagram shows some of the details of the PPU pipeline. It’s from this article and is definitely worth reading if you are at all interested in the Cell BE architecture.

Part of the PPU Pipeline

Each box in this diagram corresponds to a cycle – for example, the first 4 boxes, IC1 through IC4 correspond to the time taken for the processor to fetch instructions from the Instruction Cache –  4 cycles –  followed by a two cycle dispatch and then three cycles to decode and three more to issue the instruction to either the Branch Unit, Fixed Point Unit or the Load Store Unit. All instructions have to go through this 12 cycle front end, but if this pipeline is fully populated then a new instruction can be dispatched every cycle. The time taken for an instruction to enter, pass through and then leave a pipeline is called it’s latency. I won’t go into too much depth on the details of CPU pipelines here – check your hardware documentation for more information (for consoles this is generally under NDA anyway) and/or have a read of this Ars Technica article.

Back to the case in hand: The FPU pipeline is 10 cycles long which means it will take 10 cycles for a single instruction to pass through that pipeline (the time taken to get the instruction to the FPU can be ignored for this calculation). A new FPU instruction can be issued every cycle after that; this means that f0 from the third fadds will be ready after cycle 12.

It is worth noting at this point that some instructions (such as divide and square root on the PPU), do not pipeline at all. This means that all the instructions after that command will stall until it has finished, reducing throughput significantly (some non-pipelined instructions take more than 80 cycles).

The more attentive amongst you may have noticed the reuse of the register f0  in the third fadds – this is not an issue as once an instruction is committed into the pipeline the input registers are no longer relevant to that instruction. The compiler will take care of dependencies for you, so you shouldn’t need to worry about it, merely be aware of it.


“So,” I hear you ask, “what does this have to do with the branch statement in the code we’re looking at?”

Well, I’m getting to that. Branching can be an expensive operation and the PPU tries to minimise its impact by trying to predict which way the code will branch. It does this either statically (the compiler or programmer specifies which code path is most likely taken) or dynamically (the PPU keeps a table of branches and which path the most recent branches have taken and uses that to infer the most likely branch).  Check out Mike Acton’s article on branching for more details.

A mispredicted branch will incur a complete pipeline flush which is quite expensive (I’ve not mentioned the exact value for the PPU as I’m not sure if it is covered by NDA but you should be able to infer it from the earlier pipeline diagram) and even an accurately predicted branch has a slight cost.

If we once again look at the performance graph from the first lesson we are now in a position to understand what is happening to produce this peak on the right:

Graph of execution time taken over frame number (from Lesson 1).

The flat part of the graph on the left corresponds to the expansion of the particle ball while all particles are above the collision plane. When the particles start to collide with that plane, the time taken increases dramatically, rising from around 15ms per update to nearly 60ms (when nearly all the particles are at rest on the collision plane). This increase is processing time is due to not just the extra processing when a collision occurs (three fp multiplies and a fp negate), but also due to the branch and (potential) associated pipeline flush.

Removing Branches

Fortunately, there is an instruction that can help us here; fsel (floating point select). This instruction provides a way of effectively performing our own speculative execution and then choosing one of two operands based on a third (see below):

    d = __fsel(a,b,c);
    // is the same as the following code but with no branch instruction
         d = b
        d = c;

The other benefit of this instruction is that it is pipeline friendly – it also takes 10 cycles and our pipeline is left intact once it has completed. And we can have multiple of these instructions in the pipeline at the same time.

So, let’s rewrite our particle code without any classes (to make it easier to see how it maps to asm), pre-emptively calculate the bounced position and velocity and pick the resultant position and velocity via fsel() commands based on the y value of the particle’s position.

void ParticleManager::Update(float fClock)
   const MyVector dampVelocity(0.75f,-0.5f,0.75f);
   const MyVector v3Acceleration=MyVector(0.0f, -9.8f, 0.0f) * fClock; 	//  a*t
   const MyVector at2 = v3Acceleration*fClock*0.5f;                     // ½a*t^2
   MyParticle *p = &m_Particles[0];
   MyParticle *pEnd = &m_Particles[m_NumParticles];
   for (; p<pEnd; p++)
      float newPosy = p->m_Position.y;
      float newPosx = p->m_Position.x;
      float newPosz = p->m_Position.z;
      float newVelx = p->m_Velocity.x;
      float newVely = p->m_Velocity.y;
      float newVelz = p->m_Velocity.z;
      newPosy = newVely * fClock + newPosy + at2.y;
      newPosx = newVelx * fClock + newPosx + at2.x;
      newPosz = newVelz * fClock + newPosz + at2.z;
      newVelx = newVelx + v3Acceleration.x;
      newVely = newVely + v3Acceleration.y;
      newVelz = newVelz + v3Acceleration.z;
      float bouncedPos_y = -1.0f*newPosy;
      float bouncedVelx = newVelx*dampVelocity.x;
      float bouncedVely = newVely*dampVelocity.y;
      float bouncedVelz = newVelz*dampVelocity.z;
      // choose vel based on y pos
      newVelx = __fsels(newPosy,newVelx,bouncedVelx);
      newVely = __fsels(newPosy,newVely,bouncedVely);
      newVelz = __fsels(newPosy,newVelz,bouncedVelz);
      // set y based on y pos
      newPosy = __fsels(newPosy,newPosy,bouncedPos_y);
      p->m_Position.x = newPosx;
      p->m_Position.z = newPosz;
      p->m_Position.y = newPosy;
      p->m_Velocity.x = newVelx;
      p->m_Velocity.y = newVely;
      p->m_Velocity.z = newVelz;

And here is the assembly generated by compiling the above code.

0x000003D0  bge      0x00000458
0x000003D4  lfs      f9,0x4(r4)
0x000003D8  lfs      f10,0xC(r4)
0x000003DC  lfs      f11,0x10(r4)
0x000003E0  lfs      f12,0x14(r4)
0x000003E4  fmadds   f9,f1,f11,f9
0x000003E8  lfs      f13,0x0(r4)
0x000003EC  fadds    f0,f10,f5
0x000003F0  lfs      f31,0x8(r4)
0x000003F4  fadds    f11,f11,f4
0x000003F8  fadds    f30,f12,f5
0x000003FC  fmadds   f10,f1,f10,f13
0x00000400  addic    r5,r4,0x18
0x00000404  fmadds   f12,f1,f12,f31
0x00000408  fmadds   f13,f6,f2,f9
0x0000040C  fnmadds  f9,f6,f2,f9
0x00000410  fmuls    f31,f8,f0
0x00000414  fmuls    f29,f7,f11
0x00000418  fmuls    f28,f8,f30
0x0000041C  fadds    f10,f10,f3
0x00000420  fadds    f12,f12,f3
0x00000424  fsel     f9,f13,f13,f9
0x00000428  fsel     f0,f13,f0,f31
0x0000042C  fsel     f11,f13,f11,f29
0x00000430  fsel     f13,f13,f30,f28
0x00000434  stfs     f10,0x0(r4)
0x00000438  stfs     f12,0x8(r4)
0x0000043C  stfs     f9,0x4(r4)
0x00000440  stfs     f0,0xC(r4)
0x00000444  stfs     f11,0x10(r4)
0x00000448  stfs     f13,0x14(r4)
0x0000044C  clrldi   r4,r5,32
0x00000450  cmplw    r4,r3
0x00000454  blt      0x000003D4

If you have a HTML5 compliant browser, then you can interactively walk through the ASM and C++ code in another window by clicking on the following image:

As you can see from this assembly (particularly via the interactive walkthrough), the only branches are associated with the for loop which iterates through all of the particles.

We would expect this code to run faster as there are no more data accesses and there are less stalls in the worst case scenario (all particles mis-predicting a branch), but how much faster? The following graph shows the performance of the code in lesson 1 against this lesson’s performance.

As you would expect, the performance is constant as there is no branching based on the particle’s position, but what you might not have expected is that the baseline performance is better than the original solution. Even though we are calculating the bounced position and velocity in addition to the non-bounced ones for every particle, this code is still significantly faster than the branching one. Basically, we are doing more work in fewer cycles. The removal of the branch removes the need for a compare (which stalled waiting for the result of the increment of the y position) as well as the branch and, thereby allows better pipelining of the inner loop.

This is a pretty good performance improvement for a fairly small change, but finding wins like this in production code is rare. This type of optimisation really only benefits very tight inner loops; using branchless logic in higher level code will usually have no measurable performance improvement at the cost of obfuscating your code – so choose wisely.

The important things to take away from this lesson are:

  1. Instructions pipeline. Taking advantage of that can give you big performance wins
  2. Branching can be bad on certain architectures (primarily because it trashes the pipeline)

Coming up next

So where to next? Let’s have a look at the PC samples of this new code and see where the bottleneck has moved to:

Samples attributed to C++ code

Most of the samples in this code seem to have coalesced around the calculation of the y position of the particles. The reason for this is very interesting and we’ll address this in the next lesson.

So endeth the second lesson. Please leave a comment if you’ve spotted an error, need something clarified or would just like to leave some feedback.

[Edit] I’ve just been pointed at an excellent article on pipelines and branchless maths: “Down With fcmp: Conditional Moves For Branchless Math”

rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Thu, 10 Nov 2011 15:43:00 +1030
Optimisation Lesson 1: Profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling http://overbyte.com.au/index.php/overbyte-blog/entry/optimisation-lesson-1-profiling

The performance of a game is critical – it defines the scope of a game. It specifies how many, how far, how much detail, the level of freedom, even the length of the project. Whether you are an artist, designer, producer or programmer, the performance of a game has an impact on you. As Jeff Atwood recently stated, “Performance is a feature” and as with all features, they need to be scheduled, researched, implemented and maintained.

This article is the first in a series based on the Masterclasses which I’ve recently presented for Game Connection, and will walk you through several phases of optimisation performed on a simple piece of code. It is primarily aimed intermediate programmers  or experienced programmers who haven’t had much to do with performance and low level programming. The aim is to give a detailed example of finding performance bottlenecks and then addressing them. It should give you and idea of how code can be optimised and the magnitude of performance improvements you can expect from different changes. The series of lessons will cover profiling, branch removal, cache aware programming, loop unrolling and SIMD programming.

Introducing A Simple Particle System

The first piece of code we’re going to look at is a very simple particle system. We will look at the update phase only, ignoring the rendering and GPU performance for this example considering only the CPU for now. The CPU in this case is the PlayStation®3′s PPU and external profiling is done with SN System’s excellent Tuner.

There are a couple of simple classes, the vector class  MyVector and the particle class, MyParticle.

class MyVector
    MyVector(float X, float Y, float Z) {x=X;y=Y;z=Z;}
    inline const MyVector& operator+=( const MyVector& rhs );
    float x,y,z;
inline const MyVector &MyVector::operator+=( const MyVector& rhs )
    x += rhs.x; y += rhs.y; z += rhs.z;
    return *this;
inline MyVector operator+( const MyVector& lhs, const MyVector& rhs )
    return MyVector( lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z );
inline MyVector operator*( const MyVector& lhs, const MyVector& rhs )
    return MyVector( lhs.x * rhs.x, lhs.y * rhs.y, lhs.z * rhs.z );
inline MyVector operator*( const MyVector& lhs, float rhs )
    return MyVector( lhs.x * rhs, lhs.y * rhs, lhs.z * rhs);

and the particle class;

struct MyParticle
    MyVector m_Position;
    MyVector m_Velocity;

The particles in this system will accelerate under gravity and will bounce on the y equals zero plane with some damping applied to approximate energy lost in collision. If we can remember back to our high school maths, the equations of motion are;


 {v}' = v + at

{p}' = p + vt + \frac{1}{2} at^{2}


So, the update loop for the code is:

void ParticleManager::Update(float fClock)
    int iParticle;
    const MyVector dampVelocity(0.75f,-0.5f,0.75f);
    const MyVector v3Acceleration=MyVector(0, -9.8, 0) * fClock; 	//  a*dt
    for (iParticle=0; iParticle<m_NumParticles; iParticle++)
        MyParticle &p=m_Particles[iParticle];
        // v' = v + a*dt
        p.m_Velocity += v3Acceleration;
        // Update the particle position
        // p'=p + v * t + 0.5 * a * dt^2
        p.m_Position +=
            + v3Acceleration*fClock*0.5f;
        // if Particle is below the ground plane, then damp and reverse the velocity
        if (p.m_Position.y<0)
            p.m_Position.y*= -1;  // reflect position along y
            p.m_Velocity= p.m_Velocity*dampVelocity;

This code gives us the following output (for 256k particles)

Particle system image

A simple particle system in action.

Step 1: Profile

The first thing we need to do is figure how fast this code is running, and for that we need a useful metric. Frames per second is borderline useless and a percentage of a frame is just as bad. Use a sensible metric like milliseconds or microseconds – then you can then accurately measure the length of a function independent of the frame rate. Also, if possible, capture profiling information over a number of frames. This will allow you to find spikes and patterns in the execution of your code. When optimising code, the spikes are generally what need to be addressed first (and are often the easiest to fix too).

Consider the following graph of this particle system. The vertical axis is milliseconds taken to process all of the particles and the horizontal axis is time (# frames)

Graph of execution time taken over frame number.

Note the shape of the graph – it starts at around 17ms per frame and then dramatically sweeps up to sitting at about 60ms per frame. There is obviously something interesting happening here that requires further examination. Lets run a profiler over this code and see if we can get a better idea of what is happening with this example.

There are two main types of profiling that are generally used; instrumented and sampling.

Instrumented Profiling

Instrumented profiling relies on either manual or automatic insertion of code that stores a time stamp at set points for later analysis. Manual instrumentation relies on the programmer inserting code which wraps the code to be timed, for example, at the start and ends of a function.

void FunctionThatIsTakingALongTime()
   PROFILE_START("Function Name");
   for(int i=0; i<BigNumber;i++)

Manual instrumentation allows a programmer to store and use the timing information in any way that they see fit - on-screen profile bars, graphs or possibly profile data output to file for analysis in a separate application. It is the most versatile method of profiling, but care must be taken that the actual implementation of the profiling stubs aren’t too intrusive, slowing down the execution of the code being profiled. The graph above was produced using manual instrumentation.

Automatic instrumentation is performed by most modern profilers, and will allow runtime insertion of start/stop profiling stubs into code. This method is very useful, allowing a programmer to change which functions to instrument at run time, but relies on the profiling data being collected and displayed by an external application.

Another form of automatic instrumentation is call profiling which relies on the compiler inserting pre- and post-amble profiling code to every function. This method of profiling gives 100% coverage but is very intrusive (it slows down the running of the profiled code significantly).

With both types of instrumented profiling, the more instrumentation that occurs, the more intrusive it is and the slower the code will run. Instrumenting small, high frequency functions can artificially bloat the time taken in those functions and their parents, so care must be taken not to instrument too much.

Example of high frequency instrumentation slowing performance.

In the above diagram, each coloured bar corresponds to a function call, with the top ones calling the ones directly below them and so on, down to the fine grain leaf functions at the bottom. While this visual representation of the code path is useful for understanding the flow of code, it does slow down parts of the code non-uniformly and isn’t necessarily an accurate depiction of the relative performance of the profiled functions.

Program Counter Sampling

This form of profiling requires an external profiler which samples the executing code’s program counter (PC) at set intervals over many frames, collecting these values for later analysis. The Program Counter is a CPU register which stores the current address of execution. Each PC value and the time that it was taken at can be reconstructed to determine which functions were being executed at which time. From this data we can see the overall execution footprint of the executing code – slow code will have more samples attributed to it while the fast code will have less.

PC Sampling Example in SN Tuner

In the above example, we can see that memory allocation is taking a long time and should be investigated further. The annotation has been added manually.

PC sampling, as with instrumentation, can be intrusive and affect the execution speed of the code. The higher the frequency of sampling, the more intrusive it is (although, in this case it has a uniform impact across the entire frame) , so it is advisable to minimise the sampling frequency and collect over many frames instead to get a more indicative performance measurement unless you explicitly need high frequency information for some reason.

Profiling the Particle System

So, back to the problem at hand; we profiled the particle system using PC sampling and the profiler  attributed the samples to the following lines of code:

ParticleManager::Update() samples

The first column is the total number of samples at that PC, while the second column is the number of samples for that frame. Basically, the larger the number attributed to a line, the slower that line is to execute. The problem is that the PC samples correspond to particular assembly instructions and the application is responsible for determining which lines of C++ source those instructions are part of. This is a non-trivial problem (See the interactive example at the end of this lesson), as optimised code will generally move instructions around into an order which provides the best performance, so what we are left with is an approximation of where the slowest lines of code are. At the function granularity these samples are accurate, just not at the level of C++ instructions.

So, in order to accurately see what is taking the most time within a function, we must look at the disassembled instructions for this code fragment. I know what you’re thinking. You’re thinking

“Assembly!? I can’t read that. Only really old guys and freaks can read that!”

But don’t worry, it isn’t necessary to be fluent in the asm of your platform of choice, but it is important that you are able to understand the gist of the disassembled code. It’s like being in a foreign country – you don’t need be conversational (although that is ideal) but it will help immensely of you know some of the basics; the “please”, “thank you” and “Excuse me, where’s the bathroom?” – the phrases you use all the time.

For this lesson, I’ll not attempt to teach any assembly. I’ll leave that as an exercise for the reader  - it should suffice for this PPU based example to know that instructions starting with an ‘l‘ are load instructions, and those starting with ‘s‘ are stores. Instructions starting with an  ’f‘ are floating point operations – the full mnemonic is generally enough to guess the rest. If you want excruciating detail, the PPC books can be found here.

The next image shows the PC samples attributed to the individual assembly instructions. This tells us accurately which instructions are taking the longest to execute.

The numbers on the left are the samples attributed to that line of assembly, and this tells exactly what code we should be looking at. So, how do these instructions relate to the C++ code?  Well, I’ve coded up a simple interactive page which will allow you to step through the C++ of the inner loop and its generated assembly instructions interactively (as long as you have a HTML5 compliant browser that is – otherwise you’ll just get a blank page. In which case you need to go here). The C++ code being executed is highlighted on the right and the corresponding ASM on the left is shown so you can see what the code compiles into. Take note of the fact that there isn’t a direct linear relationship between the C++ and the ASM, the generated ASM can be quite out of order (even in this simple example). The blue box in the bottom right give a description of what is happening for that line.

HTML 5 compliant browser? Click here to step through the code

And so ends this first lesson. For the next lesson we’ll be looking at the hotspots in the generated assembly and figuring out why they are slow and how we can address them.

As always, feedback and questions are encouraged – please leave a comment.


rtaubert@sunshinedc.com.au (Tony Albrecht) Programming Fri, 21 Oct 2011 01:09:00 +1030