Optimisation Lesson 3: The Memory Bottleneck
Welcome to Lesson 3 of the Overbyte Optimisation series.
If you cast your mind back to the end of lesson 2, we were left with the following code and corresponding samples:
(Note that the samples here are not exactly the same as the ones in the previous lesson, even though the code is. This is due to a change in the version of the compiler being used – we now have faster, more interesting code)
From looking at the source, there is nothing obviously expensive in the calculations at the highlighted points – no branches, square roots, divides or other non-pipelined instructions. So, there must be something other than instruction latency causing the performance problem.
Lets have a look at the disassembled code and see if that sheds any light:
I’ve highlighted the two biggest samples – almost an order of magnitude larger than the other samples (note that the order of the samples is swapped – this is a good example of the compiler reordering instructions into a more optimal format). From the last lesson, we know that
fmadds are only 10 cycles long, so that shouldn’t cause the large number of hits. Let’s look at the C code in relation to the generated assembly:
In the optimised code the y position is first loaded, followed by the X velocity and Y velocity, then the x velocity is added to the already loaded x component of the acceleration value. It’s this addition that seems to be taking most of the time – (we’ll look at the reason for the second stall later in this lesson). What is actually happening is that the
fadds instruction is stalling, waiting for the x velocity to be loaded into the f10 register as the addition can’t take place until it has the correct data in all the appropriate registers. The large number of hits on this instruction implies that this memory load is taking a very long time indeed.
So what’s involved in loading memory into the CPU registers?
In order to load data into a register on the CPU it must be transferred from the memory subsystem. Back in the good old days, memory access times were comparable to instruction times – each in the order of a few cycles. Over the following decades improvements in CPU speeds followed Moore’s law increasing their performance logarithmically while memory access speeds crawled along, improving performance by only 10% per year.
In order to reduce this memory bottleneck, CPU designers added a small amount of very fast, very expensive memory called a cache. These caches come in different flavours – Level 1 cache is the fastest and smallest and is generally found on chip. It can only pull data from Level 2, which is larger and slower while level 3 cache (if your CPU has one) is larger and slower again. The highest order cache in your system is the one that retrieves data from main memory. Each element in a cache contains not only some data that corresponds to a copy of the data in a location in main memory, but also information on where that data comes from. Each cache can only access data from the level below (or main memory if its is the highest order one). A great explanation of how the different types of caches work can be found here.
In the case we are looking at, the instructions request some data to be loaded into a floating point register. The CPU will first query the L1 cache for that data and if it is already resident there, will return it within a matter of a few cycles. If it is not there then the L2 cache is queried and if it is present there then it will be returned as soon as possible, which in this case (on the PS3) will be around 30 cycles. If however, the required data isn’t in the level 2 cache, then main memory must be accessed which takes hundreds of cycles (somewhere between 400 and 600 on PS3 and X360). This is an eternity as far as the CPU is concerned – hundreds of instructions can be executed in the time it takes to pull data from main memory.
If you had a single instruction that could take 100s of cycles, how often would you use it?
The following interactive page illustrates the relative times for accessing data from L1, L2 and main memory. Click on the different memory subsystems to see how slow main memory is. (Page requires HTML5)
As you’ve experienced above, retrieving data from main memory is very slow – painfully slow. (In this demo, the latency for the different gaps are; L1$ -> CPU: 4 cycles, L2$ -> L1$: 30 cycles, Main Mem ->L2$: 600 cycles). An L2 cache miss is in the order of 50 times slower than the average instruction latency.
Minimising the memory bottleneck
So, what can we do about it? We need to use memory, so how do we minimise the impact of these long reads from main memory?
The simplest way is to just use less memory – minimise the size of your regularly used (hot) structures. Do you really need to an array of rarely used data (eg char name;) in the struct or can you just maintain a pointer to it? If it is a large structure, can you break it up into hot and cold parts, or at least parts that are used in sequence?
In the particle system case we have here, these aren’t really options (unless you want to drop down to 16bit fixed point arithmetic). Another alternative is to make sure that the data you are accessing is contiguous: each cache load will read in an entire cache line (on the X360/PS3 this is 128 bytes), so you should make sure that all of the data being read is actually being used. Traversing an arbitrarily allocated linked list is a pathologically bad example of this, where each read of the next pointer is most likely a cache miss. Traditional scenetrees are another example of bad cache behaviour, but there are ways around both of these cases (I won’t cover it in this article, but you can see an example of a scene tree optimisation here in an article I wrote a couple of years ago). In the case we’re dealing with here, this won’t help either as all of our data is already in nice flat arrays and nothing is read that we don’t use.
The third optimisation is to use a special instruction that will start fetching some data for us, ideally well before we need it so that when we do need it, it will hopefully be in the L2 cache ready for us. This is called prefetching and, when used effectively, can have a dramatic effect on your code’s performance.
On the PlayStation3, the prefetch instruction is
loc is the location of the data to be prefetched. It triggers a load of a single cache line (128 bytes) into L2 cache. Let’s add one into our loop and see what happens:
Well, our stalls seem to have vanished, but what about the performance? The original code was running at around 10.8ms – this code now executes in 4.5ms! Nice! We’ve sped up our code by a factor of two by adding an instruction.
I chose (p+10) as the location to prefetch as the inner loop is about 40 cycles and fetching 10 particles ahead corresponds to 10*40 = 400 cycles which is approximately the time taken to fetch data from main memory. The graph below shows the different times taken for different prefetch distances:
The first column corresponds to the code without the prefetch instruction, while the second column prefetches the particle that we are about to use in that iteration – note this indicates the cost of the prefetch instruction is non-zero, but very soon the effect of prefetching mitigates that cost. At 8 to 10 particles look ahead (192 to 240 bytes) we start to plateau. Now it is worth noting that the
__dcbt() instruction loads in 128 bytes at once, so when we’re calling
__dcbt() within our loop, we’re actually prefetching the same cacheline 5 times. This is still cheaper than testing if the next particle to be prefetched is in the same cacheline or not (if we unrolled the loop to do 5 particles (5 * sizeof(MyParticle) = 120) at once we’d get better prefetch distribution (as well as pipeline utilisation), but we’ll look at loop unrolling in a later lesson).
The reason for the dual stalls
The reason for the dual stalls in the original code is due to the size of MyParticle relative to the size of the cache line. When a single particle (MyParticle is 24 bytes in size) is loaded, an entire 128 byte cache line is loaded. This means that 5 particles will be loaded in with that one, as well as part of the next one. The following diagram shows the distribution of particles over 3 consecutive cache lines (the pattern repeats every 3 cache lines).
The first case where the struct aligns exactly with a cache line results in a hit on the first stall as the x velocity is used. The next 4 particles are all within that cache line and so do not incur another cache miss penalty. The first miss happens loading the 6th particle and contributes to the first stall once again – py is in cache, vx is the first miss. The next miss contributes to the second stall as vx is already in the cache and the processor has to wait while vy is fetched from main memory. This behaviour matches our observations with approximately 2/3 of the hits being on the first stall and 1/3 on the second. When we added prefetching both of the stalls were removed (as we would expect).
What we’ve seen here is how to identify and fix a simple system which is preforming sub-optimally due to cache misses. The memory used is already contiguous and homogeneous, so prefetch look aheads are easy to implement. More complex memory access patterns can be considerably more difficult to optimise, so I would suggest that one of your first passes at optimising a system (after algorithmic optimisation) is to simplify your memory access. Determining what data is used in a tight loop and then ensuring that data is stored in a cache friendly fashion will not only give you an instant speed boost, but will make it easier to prefetch and thereby optimise further.
When designing a system that needs to be fast, it is critical that you consider the layout of the memory being used. Optimal code is irrelevant if your data is badly organised.
TLDNR: put all your data in flat arrays.
Coming up next
Now that we’ve removed the memory bottleneck, we’ll next have a look at further optimising this code by addressing the instructions used. In particular, we’ll start using some SIMD instructions and see how that impacts our performance.
Until then, remember, it’s all about the memory.