How Mechanical Sympathy got me to the airport on time

Lets talk about mechanical sympathy.  Martin Thompson has been making this term very popular in software development, so it’s best to read his description of why he used the term.
I had a perfect example yesterday.  I’m about to drive to the airport and the car won’t move (I’m a modern tomboy, I can write stories about hair and stories about cars). Not at all.  It’s stuck.  I can’t reverse out of my parking space.
The first thing that occurs to me is something is stuck under the car.  Like a cat.  Those buggers hide in the stupidest of places.  So I get out and check there’s nothing wedged against the wheels, which seems like the most logical thing that would stop the car moving.
(OK that’s not true.  The first thing that occurs to me is oh-my-god-I’m-going-to-miss-my-plane-and-I-haven’t-got-a-backup-plan-to-get-to-the-airport-and-I’ve-already-paid-for-parking-and-I-would-like-to-cry-now-but-that’s-not-going-to-help)
Since there’s nothing under the car, the next thing that occurs to me is the handbrake is stuck on - it feels like it’s the rear left wheel that won’t move, and I know (probably as I had a similar problem on my previous car, my lovely but ancient MX5, or maybe because I’ve seen far too much Top Gear) that the handbrake applies to the rear wheels.
I also suspect that the handbrake jamming this is the correct answer because I put the car through the car wash on Monday, and the known issue of the brakes jamming on some Volkswagens always manifests in my car after the car wash.  I should probably just let it get dirty.
Because I’m big on tests, and scientific theory, I test this hypothesis.  I rev the hell out of the car and force it backwards.  OK, I’ll be honest - this wasn’t a test.  This was a brute force attempt to unjam the handbrake. But it did provide a suitable test - when I got out of the car to inspect the offending wheel, I could see skid marks on the floor where the wheel had been dragged, rather than rolled, over the tarmac.  Oops.
So my (limited) knowledge of how a car works has provided me with a hypothesis, and reasonable proof that it’s correct.
That’s all fine, but I’m still not on the way to the airport yet.  How did I fix it?
What do you think I did? Well, I drove it back and forth a view times, hoping the jolting and extreme discomfort of the brake pad against a moving wheel would unlock the brake (from previous experience, the brake clicks off, rather than simply easing off).
Of course it didn’t work, and I’m probably going to have to replace that tyre much sooner than expected.  Ooops x2.
I sat in the car and pumped the foot brake millions of times - a trick that either inexplicably worked on the MX5, or simply distracted me long enough and the brakes fixed themselves eventually.
This didn’t work, not surprisingly (not the same brakes after all).
Finally, I got out of the car and kicked the wheel a few times.  Hard.  I figured this might jolt the brake back into position.
It didn’t
Then I used my brain.  That’s not going to work when you unconsciously put the hand brake on when you leave the car to kick the wheel.
So I took the brake off, then kicked the wheel.  Hard.  A lot.
And it worked!!!
And this is a bit like diagnosing performance problems (no, really).  My basic understanding of how the car was put together, plus some experience with similar problems in the past, gave me guidance on where the issue was and what might be done to fix it.
So, what have we learnt today, boys and girls?
  • A basic understanding of how the hardware works can prevent you calling an expensive expert to do a simple fix.
  • Always have a backup plan (Disaster Recovery/High Availability - if not hardware, then at least some sort of process).  I was much more stressed because without the car, I had to come up with a whole new transportation plan with a very limited time budget and having no knowledge of or experience with alternatives.
  • I need to drive my car more, because then a) I would have either discovered the problem sooner, or b) it would never have got to that point.  I’m going to say that in this tenuous analogy, the equivalent is to do testing in a live-like environment, and actually do DR scenario testing.  Then you know how your hardware and software behaves under abnormal circumstances, and have some concept of how to get back to normal.

I did make it to the airport on time, because adding padding to your estimates is a Good Thing and gives you a bit of space for contingencies.  So this story does have a happy ending.
Here endeth the lesson.

Dissecting the Disruptor: Demystifying Memory Barriers

My recent slow-down in posting is because I’ve been trying to write a post explaining memory barriers and their applicability in the Disruptor. The problem is, no matter how much I read and no matter how many times I ask the ever-patient Martin and Mike questions trying to clarify some point, I just don’t intuitively grasp the subject. I guess I don’t have the deep background knowledge required to fully understand.

So, rather than make an idiot of myself trying to explain something I don’t really get, I’m going to try and cover, at an abstract / massive-simplification level, what I do understand in the area.  Martin has written a post going into memory barriers in some detail, so hopefully I can get away with skimming the subject.

Disclaimer: any errors in the explanation are completely my own, and no reflection on the implementation of the Disruptor or on the LMAX guys who actually do know about this stuff.

What’s the point?
My main aim in this series of blog posts is to explain how the Disruptor works and, to a slightly lesser extent, why. In theory I should be able to provide a bridge between the code and the technical paper by talking about it from the point of view of a developer who might want to use it.

The paper mentioned memory barriers, and I wanted to understand what they were, and how they apply.

What’s a Memory Barrier?
It’s a CPU instruction.  Yes, once again, we’re thinking about CPU-level stuff in order to get the performance we need (Martin’s famous Mechanical Sympathy).  Basically it’s an instruction to a) ensure the order in which certain operations are executed and b) influence visibility of some data (which might be the result of executing some instruction).

Compilers and CPUs can re-order instructions, provided the end result is the same, to try and optimise performance.  Inserting a memory barrier tells the CPU and the compiler that what happened before that command needs to stay before that command, and what happens after needs to stay after.  All similarities to a trip to Vegas are entirely in your own mind.

The other thing a memory barrier does is force an update of the various CPU caches - for example, a write barrier will flush all the data that was written before the barrier out to cache, therefore any other thread that tries to read that data will get the most up-to-date version regardless of which core or which socket it might be executing by.

What’s this got to do with Java?
Now I know what you’re thinking - this isn’t assembler.  It’s Java.

The magic incantation here is the word volatile (something I felt was never clearly explained in the Java certification).  If your field is volatile, the Java Memory Model inserts a write barrier instruction after you write to it, and a read barrier instruction before you read from it.

This means if you write to a volatile field, you know that:

  1. Any thread accessing that field after the point at which you wrote to it will get the updated value 
  2. Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.
Example please!
So glad you asked.  It’s about time I started drawing doughnuts again.
The RingBuffer cursor is one of these magic volatile thingies, and it’s one of the reasons we can get away with implementing the Disruptor without locking.
The Producer will obtain the next Entry (or batch of them) and do whatever it needs to do to the entries, updating them with whatever values it wants to place in there.  As you know, at the end of all the changes the producer calls the commit method on the ring buffer, which updates the sequence number.  This write of the volatile field (cursor) creates a memory barrier which ultimately brings all the caches up to date (or at least invalidates them accordingly).  
At this point, the consumers can get the updated sequence number (8), and because the memory barrier also guarantees the ordering of the instructions that happened before then, the consumers can be confident that all changes the producer did to to the Entry at position 7 are also available.
…and on the Consumer side?
The sequence number on the Consumer is volatile, and read by a number of external objects - other downstream consumers might be tracking this consumer, and the ProducerBarrier/RingBuffer (depending on whether you’re looking at older or newer code) tracks it to make sure the the ring doesn’t wrap.

So, if your downstream consumer (C2) sees that an earlier consumer (C1) reaches number 12, when C2 reads entries up to 12 from the ring buffer it will get all updates C1 made to the entries before it updated its sequence number.

Basically everything that happens after C2 gets the updated sequence number (shown in blue above) must occur after everything C1 did to the ring buffer before updating its sequence number (shown in black).

Impact on performance
Memory barriers, being another CPU-level instruction, don’t have the same cost as locks  - the kernel isn’t interfering and arbitrating between multiple threads.  But nothing comes for free.  Memory barriers do have a cost - the compiler/CPU cannot re-order instructions, which could potentially lead to not using the CPU as efficiently as possible, and refreshing the caches obviously has a performance impact.  So don’t think that using volatile instead of locking will get you away scot free.

You’ll notice that the Disruptor implementation tries to read from and write to the sequence number as infrequently as possible.  Every read or write of a volatile field is a relatively costly operation. However, recognising this also plays in quite nicely with batching behaviour - if you know you shouldn’t read from or write to the sequences too frequently, it makes sense to grab a whole batch of Entries and process them before updating the sequence number, both on the Producer and Consumer side. Here’s an example from BatchConsumer:

    long nextSequence = sequence + 1
    while (running)
            final long availableSequence = consumerBarrier.waitFor(nextSequence
            while (nextSequence <= availableSequence)
                entry = consumerBarrier.getEntry(nextSequence
            sequence = entry.getSequence
        catch (final Exception ex)
            exceptionHandler.handle(ex, entry
            sequence = entry.getSequence
            nextSequence = entry.getSequence() + 1

(You’ll note this is the “old” code and naming conventions, because this is inline with my previous blog posts, I thought it was slightly less confusing than switching straight to the new conventions).

In the code above, we use a local variable to increment during our loop over the entries the consumer is processing.  This means we read from and write to the volatile sequence field (shown in bold) as infrequently as we can get away with.

In Summary
Memory barriers are CPU instructions that allow you to make certain assumptions about when data will be visible to other processes.  In Java, you implement them with the volatile keyword.  Using volatile means you don’t necessarily have to add locks willy nilly, and will give you performance improvements over using them.  However you need to think a little more carefully about your design, in particular how frequently you use volatile fields, and how frequently you read and write them.

PS Given that the New World Order in the Disruptor uses totally different naming conventions now to everything I’ve blogged about so far, I guess the next post is mapping the old world to the new one.

Dissecting the Disruptor: Why it's so fast (part two) – Magic cache line padding

We mention the phrase Mechanical Sympathy quite a lot, in fact it’s even Martin’s blog title.  It’s about understanding how the underlying hardware operates and programming in a way that works with that, not against it.

We get a number of comments and questions about the mysterious cache line padding in the RingBuffer, and I referred to it in the last post.  Since this lends itself to pretty pictures, it’s the next thing I thought I would tackle.

Comp Sci 101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something.  So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that.  But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.

So, a refresher for those of us who studied this at school, and an intro for those who didn’t.  Beware - this post contains massive over-simplifications.

The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program.  Main memory (RAM) is where your data (including the lines of your program) lives.  We’re going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.

The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow.  If you’re doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it’s performing the operation (think a loop counter - you don’t want to be going off to main memory to fetch this to increment it every time you loop around).

The closer the cache is to the CPU, the faster it is and the smaller it is.  L1 cache is small and very fast, and right next to the core that uses it.  L2 is bigger and slower, and still only used by a single core.  L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket.  Finally you have main memory, which is shared across all cores and all sockets.

When the CPU is performing an operation, it’s first going to look in L1 for the data it needs, then L2, then L3, and finally if it’s not in any of the caches the data needs to be fetched all the way from main memory.  The further it has to go, the longer the operation will take.  So if you’re doing something very frequently, you want to make sure that data is in L1 cache.

Martin and Mike’s QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to… Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory ~60-80ns
QPI transit
(between sockets, not drawn)
L3 cache ~40-45 cycles, ~15ns
L2 cache ~10 cycles, ~3ns
L1 cache ~3-4 cycles, ~1ns
Register 1 cycle
If you’re aiming for an end-to-end latency of something like 10 milliseconds, an 80 nanosecond trip to main memory to get some missing data is going to take a serious chunk of that.
Cache lines
Now the interesting thing to note is that it’s not individual items that get stored in the cache - i.e. it’s not a single variable, a single pointer.  The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory.  A Java long is 8 bytes, so in a single cache line you could have 8 long variables.
(I’m going to ignore the multiple cache-levels for simplicity)
This is brilliant if you’re accessing an array of longs - when one value from the array gets loaded into the cache, you get up to 7 more for free.  So you can walk that array very quickly.  In fact, you can iterate over any data structure that is allocated to contiguous blocks in memory very quickly.  I made a passing reference to this in the very first post about the ring buffer, and it explains why we use an array for it.
So if items in your data structure aren’t sat next to each other in memory (linked lists, I’m looking at you) you don’t get the advantage of freebie cache loading.  You could be getting a cache miss for every item in that data structure.
However, there is a drawback to all this free loading.  Imagine your long isn’t part of an array.  Imagine it’s just a single variable.  Let’s call it head, for no real reason.  Then imagine you have another variable in your class right next to it.  Let’s arbitrarily call it tail.  Now, when you load head into your cache, you get tail for free.
Which sounds fine.  Until you realise that tail is being written to by your producer, and head is being written to by your consumer.  These two variables aren’t actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.
Imagine your consumer updates the value of head.  The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value.  And remember that we deal with the level of the whole line, we can’t just mark head as being invalid.
Now if some process running on the other core just wants to read the value of tail, the whole cache line needs to be re-read from main memory.  So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it’s slowed down by a cache miss.
Of course this is even worse if two separate threads are writing to the two different values. Both cores are going to be invalidating the cache line on the other core and having to re-read it every time the other thread has written to it. You’ve basically got write-contention between the two threads even though they’re writing to two different variables.
This is called false sharing, because every time you access head you get tail too, and every time you access tail, you get head as well.  All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that’s going to be very inefficient for concurrent access.
Our solution - magic cache line padding
You’ll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to ensure the ring buffer’s sequence number is never in a cache line with anything else.

public long p1, p2, p3, p4, p5, p6, p7 // cache line padding
    private volatile long cursor = INITIAL_CURSOR_VALUE
    public long p8, p9, p10, p11, p12, p13, p14 // cache line padding
So there’s no false sharing, no unintended contention with any other variables, no needless cache misses.
It’s worth doing this on your Entry classes too - if you have different consumers writing to different fields, you’re going to need to make sure there’s no false sharing between each of the fields.

EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.