Overbyte Blog

Any Port in a Storm

Posted by on in Programming

The Overbyte team has been busy beavering away on a number of projects over the last few months, but there is one that I thought might be of interest to a few programmers out there. We recently started work on a PlayStation 3 port of the successful Indie PC title, Vessel, and what we plan to do over the coming weeks is to document the porting process. We'll look at what we started with, what our goals were, what assumptions we made, what mistakes we made along the way and have a look at some really gritty optimisation problems.

The Game

Vessel is a very clever game, written by some clever guys at Strange Loop Games. The key feature that most of the gameplay is built around is a bespoke fluid simulation built on an in-house physics and graphics engine. It had already been optimised to run in less than 16.6ms per physics frame on PC and the performance of this part of the system is critical for fluid gameplay (pun intended). The rendering system uses deferred lighting and doesn't push the GPU too hard - although there are a lot of post process effects, lots of lights and some complicated shaders. The environment is basically 2.5D, with 3 dimensional meshes being used to construct a 2D Platformer with accurate 3D lighting. Most of the gameplay code runs in Lua, with the core systems being written in C++. All the audio was written using FMOD, one of the few third party cross platform systems used in this game.

The code was nicely multi-threaded. The Strange Loop boys had already done an excellent job of extracting the logic into Game, Physics, Loading, and Render threads (to name a few) and updates of data were carefully controlled to avoid race conditions. Also, the rendering system was neatly abstracted to aid cross platform development.


Overbyte inherited the port of Vessel from another studio, so the game was already running on the PlayStation (after some tinkering) and had a solid PS3 rendering engine and enough of the basics in place to allow a performance analysis.

Initial Performance

At this point, we had a game that basically worked and what we needed to determine was how much work was required to optimise this game to the point where it was running fast enough to be playable. A PlayStation 3 is a powerful piece of hardware, but it is not a current generation PC. It doesn't have the memory, caching subsystems, complicated, deeply pipelined, speculative CPUs or anything else that most modern programmers take for granted. This generally means that any program of significant complexity written for a PC will run slower on a PlayStation 3. By a factor of 2 or 3. Or even 4. So, what was the performance of Vessel PS3 like?

The Game thread was pretty slow, taking 30 to 40 ms per frame, with a lot of its time being taken up by Lua and its associated garbage collection. 

The Render thread was pretty good - generally taking less than 10ms and already doing a lot of work on the SPUs.

The Loading wasn't too bad - taking about 26 seconds to load a level. A bit slow, but we could work with that.

The GPU was a little slow - the post process effects were dominating but for the most part, the GPU was running just a little slower than 30fps in the cases I checked, but peaked at around 50ms per frame

The Physics thread, which was responsible for all of the collisions, fluid simulation and was a critical part of the gameplay, was running slower than 60ms per frame. Note, that's not 60fps, but 60ms per frame.  Ouch. And this critical subsystem needs to run at 16.6ms per update – the other threads can run at 30fps, but the physics needs to run at 60fps to deliver an accurate fluid simulation.


The Physics Thread

Looking closer at the physics’ execution and code that simultaneously made my heart sink and rise – it was heavily STL based, using lots of containers for storage and iteration. As a result of this, there was also a lot of dynamic allocation of memory going on. A lot, as in around 6,000 calls to malloc or new per frame! And this was just in the Physics frame – the Game thread was also using memory (thanks Lua) and when both threads were running memory management at the same time, both threads would slow down dramatically as the multi-threaded memory manager started hitting the OS and using the synchronisation primitives to ensure thread safety and thereby blocking on thread or the other.

Implicit allocations in STL containers occur whenever a container grows beyond its capacity and needs to be reallocated. This results in a new allocation of a chunk of memory which is of a larger size, and the deletion and re-construction of contents of the old data. Generally this isn’t a problem, but in a system where performance is critical, thousands of allocations per frame are a needless waste.

So, while we knew that the STL allocations were bad, they were a bad that we understood and could hopefully remedy.

There were also other areas of code where contained simple functions that were called with a very high frequency – these types of functions are ideal for optimisation. You can unroll them, massage their data into cache friendly formats, make them SIMD, and possibly port them to SPU and run them in parallel.

So, looking at the performance of the game as a whole, the physics system was the biggest issue. It would have to be sped up by a factor of 4 (at least) and would most likely involve the porting of entire subsystems to SPU.

A Taste of what’s to come

Here’s an example of what we’ll be covering; in the previous section I mentioned that implicit STL allocations were bad – look at the following code.



This code iterates through an arbitrary number of springs (which contain the drops that they connect) and adds the spring IDs to vectors specified by the drops. The daSpringsFromDrop passed in from the calling function is declared on the stack and is only transitory. This function allocates memory during the resize() and whenever a spring is added to a drop array causing it to grow (which is pretty much every time). 

If we look at the span of inputs for this function, we see that there can be up to over 1,000 springs and 3,000 drops. With these numbers we are looking at potentially thousands of memory allocations in this function alone, and on PS3, this function can peak at 10ms per call (including the destruction of the daSpringsFromDrop vector).

So, what’s the best way to optimise this function? Quite simply, don’t allocate memory if you don’t need to. All we did for this was globally declare the daSpringsFromDrop vector<> and only let it grow, never deleting it. We need to be careful that we clear all the entries of the outer vector, clear()ing the inner vectors before we call GetSpringsFromDrops(), but the overhead of this is minimal compared to dynamically constructing and deleting the vectors (not to mention reallocating during calls to GetSpringsFromDrops()). This optimisation resulted in a speed up of over 9ms, leaving this function running at around 1ms per call. Much better.

It’s worth noting that there is nothing wrong with the original code on the original platform. It performs well with the default Windows memory system and STL implementation so there was no need to optimise it there. Things are different on consoles though.

Over the coming weeks, we will be posting regular updates on how we have optimised different parts of the Vessel code base. We’ve already put in a couple of month’s work, so we have a back log of content to deliver – not just optimisations though. We’ll also be talking about some of the basic issues that a team has to deal with when porting to a console – TRC, memory limitations, asset optimisation, and constraints on third party tools. But, primarily, we’ll be looking at the optimisations required to speed this game up by at least 40ms per frame.

I hope you’ll join us on this journey. It should at least be interesting.


I've been a professional game developer since 2000, specialising in the hard core, low level, highly technical programming that is required to produce games that keep getting bigger and better. I love writing well specified, high performance code and rebuilding existing systems to function at the highest levels of performance. I take pride in understanding how the hardware works at the lowest levels so that I can eke out the best performance at the higher levels.


  • Guest
    Martin V Tuesday, 17 September 2013

    nice .. I'll keep an eye on this, interesting challenge!

  • Wes Robb
    Wes Robb Tuesday, 17 September 2013

    Great write-up. I'm always interested to see software written by someone with a deep understanding of the hardware.

  • Guest
    DVT Wednesday, 18 September 2013

    Very interesting!

  • Ethan
    Ethan Thursday, 10 October 2013

    Interesting read!

    Here are some random thoughts I had while looking at the code/article....
    - Do you need to be calling the springs.size() inside the loop?
    - Declaring/initializing iSpring before the loop.
    - Do you need to have iSpring++ instead of ++iSpring?
    - Do the IDrops need to be const ints and declared inside the loop? I suppose for other reasons that are hard to tell from this portion of code they do.

  • Tony Albrecht
    Tony Albrecht Thursday, 10 October 2013

    Thanks Ethan.
    - the .size() in the loop makes very little difference to performance. It's worth leaving it there purely for clarity.
    - declaring iSpring outside the loop will make no difference to performance at all.
    - preincrementing or postincrementing still results in a single addition. No difference to performance.
    - The iDrops are declared as consts purely for the convenience of the reader. They are declared on the stack and moving those declarations outside will again, make no difference.

    The overbearing bottleneck in this code is the memory allocation implicit in the push_back()s. The amount of code in there is vast compared to this tiny loop. You should be careful about obfuscating your code while hoping to improve performance without actually making any perceptible difference.

Leave your comment

Guest Monday, 21 June 2021

Serious. Game. Performance.