Mike and I debut our new Disruptor presentation

Last Tuesday Mike and I unveiled our brand shiny new presentation: Understanding the Disruptor, a Beginner’s Guide to Hardcore Concurrency.  This was a preview of the talk we’ll be doing at JAX London on the 2nd November.

A video of the session is available, as are the slides.  I promise not to say “so” anywhere near as many times when I repeat my performance at JAX (is there anything more painful than watching yourself on video?).

I thought the session went really really well.  We got some great questions at the end, we had an audience that was engaged, and I was dead pleased we didn’t lose anyone with the assembly language.  We had some very valuable feedback afterwards too.

As well as our presentation, there were three great lightning talks:

    Somay Nakhal on Java Thread States - Somay gave a nice overview of thread lifecycles with code and some great diagrams.  I liked how he made this more applicable to the real world than the sort of book examples you get.

    Ged Byrne on the shiny new LJC Book Club - Ged reminded us how great it is to read an actual, paper book.  How committing to reading page by page forces you to learn in a different way to jumping around internet references that might not give you the context you need.  I thought this was a great presentation with humour, and I liked the way he challenged us to “expand our minds”.  Although the actual book he was reviewing was Oracle Coherence 3.5, I’ve decided I need to read Beautiful Software, which Ged quoted at the end of the talk.

    Peter Lawrey on Common Java Misconceptions - A session which plays well with what we’re trying to preach when we talk about Tackling Folklore.  He covered a few topics that are assumed to be “truth”.  For example, dealing with garbage collection is not a mandatory part of writing Java - you could write GC-friendly code for a start.  Also it’s naive to assume the JDK is written in an efficient way, anyone who’s actually dug around it for a while will realise that newer, more efficient methods of programming have not been applied to all areas of the (massive) existing code base.  I think it’s great to have people out there talking about this stuff, it’s too easy to make assumptions and take things for granted.  The most important thing he said: “If you’re told something, don’t just believe it - test it yourself first”.

All of us (me, Mike and the lightning talk presenters) got such a great response it has encouraged us at the LJC to try and push for more real developers presenting their experiences.  We have a lot of great presentations from vendors, but what’s more applicable to Java guys and girls across the board is other developers sharing the problems they’re trying to solve and how they go about that process.

I’m very much looking forward to presenting this again at JAX.

Disruptor 2.0 – All Change Please

Martin recently announced version 2.0 of the Disruptor - basically there have been so many changes since we first open-sourced it that it’s time to mark that officially.  His post goes over all the changes, the aim of this article is to attempt to translate my previous blog posts into new-world-speak, since it’s going to take a long time to re-write each of them all over again. Now I see the disadvantage of hand-drawing everything.

In the old world

This is an example of a configuration of the Disruptor (specifically a diamond configuration).  If none of this means anything to you, feel free to go back and refresh yourself on all the (now outdated) Disruptor details.

The most obvious changes over the last few weeks have been:

  1. Updated naming convention
  2. Integrating the producer barrier into the ring buffer
  3. Adding the Disruptor wizard into the main code base.
The New World Order
You’ll see the fundamentals are pretty much the same.  It’s simpler, because the ProducerBarrier is no longer an entity in its own right - its replacement is the PublishPort interface, which is implemented by the RingBuffer itself.
Similarly the name DependencyBarrier instead of ConsumerBarrier clarifies the job of this object; Publisher (instead of Producer) and EventProcessor instead of Consumer also more accurately represent what these things do.  There was always a little confusion over the name Consumer, since consumers never actually consumed anything from the ring buffer. It was simply a term that we hoped would make sense to those who were used to queue implementations.
Not shown on the diagram is the name change of the items in the RingBuffer - in the old world, we called this Entry, now they’re an Event, hence EventProcessor at the other end.
The aim of this wholesale rename has not been to completely discredit all my old blogs so I can continue blogging about the Disruptor ad infinitum. This is far from what I want - I have other, more fluffy, things to write about.  The aim of the rename is to make it easier to understand how the Disruptor works and how to use it. Although we use the Disruptor for event processing, when we open sourced it we wanted it to look like a general purpose solution, so the naming convention tried to represent that.  But in fact the event processing model does seem more intuitive.

No more tedious wiring
Now the Disruptor wizard is part of the Disruptor itself, my whole post on wiring is pretty pointless - which is good, actually, because it was a little involved.

These days, if you want to create the diamond pattern (for example the FizzBuzz performance test), it’s a lot simpler:

DisruptorWizard dw = new DisruptorWizard<FizzBuzzEvent>(
ENTRY_FACTORY,
RING_BUFFER_SIZE,
EXECUTOR,
ClaimStrategy.Option.SINGLE_THREADED,
WaitStrategy.Option.YIELDING);
FizzBuzzEventHandler fizzHandler =
new FizzBuzzEventHandler(FIZZ);
FizzBuzzEventHandler buzzHandler =
new FizzBuzzEventHandler(BUZZ);
FizzBuzzEventHandler fizzBuzzHandler =
new FizzBuzzEventHandler(FIZZ_BUZZ);

dw.handleEventsWith(fizzHandler, buzzHandler)
.then(fizzBuzzHandler);

RingBuffer ringBuffer = dw.start();
Note there is a Wiki page on the Disruptor Wizard.

Other changes: performance improvements
As Martin mentions in his post, he’s managed to significantly improve the performance (even more!) of the Disruptor in 2.0.

The short version of this is that there is a shiny new class, Sequence, which both takes care of the cache line padding, and removes the need for memory barriers.  The cache line padding is now done slightly differently because, bless Java 7’s little cotton socks, it managed to “optimise” our old technique away.

I’ll leave you to read the details over there, in this post I just wanted to give a quick summary of the changes and explain why my old diagrams may no longer be correct.

Dissecting the Disruptor: Demystifying Memory Barriers

My recent slow-down in posting is because I’ve been trying to write a post explaining memory barriers and their applicability in the Disruptor. The problem is, no matter how much I read and no matter how many times I ask the ever-patient Martin and Mike questions trying to clarify some point, I just don’t intuitively grasp the subject. I guess I don’t have the deep background knowledge required to fully understand.

So, rather than make an idiot of myself trying to explain something I don’t really get, I’m going to try and cover, at an abstract / massive-simplification level, what I do understand in the area.  Martin has written a post going into memory barriers in some detail, so hopefully I can get away with skimming the subject.

Disclaimer: any errors in the explanation are completely my own, and no reflection on the implementation of the Disruptor or on the LMAX guys who actually do know about this stuff.

What’s the point?
My main aim in this series of blog posts is to explain how the Disruptor works and, to a slightly lesser extent, why. In theory I should be able to provide a bridge between the code and the technical paper by talking about it from the point of view of a developer who might want to use it.

The paper mentioned memory barriers, and I wanted to understand what they were, and how they apply.

What’s a Memory Barrier?
It’s a CPU instruction.  Yes, once again, we’re thinking about CPU-level stuff in order to get the performance we need (Martin’s famous Mechanical Sympathy).  Basically it’s an instruction to a) ensure the order in which certain operations are executed and b) influence visibility of some data (which might be the result of executing some instruction).

Compilers and CPUs can re-order instructions, provided the end result is the same, to try and optimise performance.  Inserting a memory barrier tells the CPU and the compiler that what happened before that command needs to stay before that command, and what happens after needs to stay after.  All similarities to a trip to Vegas are entirely in your own mind.

The other thing a memory barrier does is force an update of the various CPU caches - for example, a write barrier will flush all the data that was written before the barrier out to cache, therefore any other thread that tries to read that data will get the most up-to-date version regardless of which core or which socket it might be executing by.

What’s this got to do with Java?
Now I know what you’re thinking - this isn’t assembler.  It’s Java.

The magic incantation here is the word volatile (something I felt was never clearly explained in the Java certification).  If your field is volatile, the Java Memory Model inserts a write barrier instruction after you write to it, and a read barrier instruction before you read from it.


This means if you write to a volatile field, you know that:

  1. Any thread accessing that field after the point at which you wrote to it will get the updated value 
  2. Anything you did before you wrote that field is guaranteed to have happened and any updated data values will also be visible, because the memory barrier flushed all earlier writes to the cache.
Example please!
So glad you asked.  It’s about time I started drawing doughnuts again.
The RingBuffer cursor is one of these magic volatile thingies, and it’s one of the reasons we can get away with implementing the Disruptor without locking.
The Producer will obtain the next Entry (or batch of them) and do whatever it needs to do to the entries, updating them with whatever values it wants to place in there.  As you know, at the end of all the changes the producer calls the commit method on the ring buffer, which updates the sequence number.  This write of the volatile field (cursor) creates a memory barrier which ultimately brings all the caches up to date (or at least invalidates them accordingly).  
At this point, the consumers can get the updated sequence number (8), and because the memory barrier also guarantees the ordering of the instructions that happened before then, the consumers can be confident that all changes the producer did to to the Entry at position 7 are also available.
…and on the Consumer side?
The sequence number on the Consumer is volatile, and read by a number of external objects - other downstream consumers might be tracking this consumer, and the ProducerBarrier/RingBuffer (depending on whether you’re looking at older or newer code) tracks it to make sure the the ring doesn’t wrap.

So, if your downstream consumer (C2) sees that an earlier consumer (C1) reaches number 12, when C2 reads entries up to 12 from the ring buffer it will get all updates C1 made to the entries before it updated its sequence number.

Basically everything that happens after C2 gets the updated sequence number (shown in blue above) must occur after everything C1 did to the ring buffer before updating its sequence number (shown in black).

Impact on performance
Memory barriers, being another CPU-level instruction, don’t have the same cost as locks  - the kernel isn’t interfering and arbitrating between multiple threads.  But nothing comes for free.  Memory barriers do have a cost - the compiler/CPU cannot re-order instructions, which could potentially lead to not using the CPU as efficiently as possible, and refreshing the caches obviously has a performance impact.  So don’t think that using volatile instead of locking will get you away scot free.

You’ll notice that the Disruptor implementation tries to read from and write to the sequence number as infrequently as possible.  Every read or write of a volatile field is a relatively costly operation. However, recognising this also plays in quite nicely with batching behaviour - if you know you shouldn’t read from or write to the sequences too frequently, it makes sense to grab a whole batch of Entries and process them before updating the sequence number, both on the Producer and Consumer side. Here’s an example from BatchConsumer:

    long nextSequence = sequence + 1
    while (running)
    {
        try
        {
            final long availableSequence = consumerBarrier.waitFor(nextSequence
            while (nextSequence <= availableSequence)
            {
                entry = consumerBarrier.getEntry(nextSequence
                handler.onAvailable(entry
                nextSequence
            }
            handler.onEndOfBatch
            sequence = entry.getSequence
        }
       
        catch (final Exception ex)
        {
            exceptionHandler.handle(ex, entry
            sequence = entry.getSequence
            nextSequence = entry.getSequence() + 1
        }
    }

(You’ll note this is the “old” code and naming conventions, because this is inline with my previous blog posts, I thought it was slightly less confusing than switching straight to the new conventions).

In the code above, we use a local variable to increment during our loop over the entries the consumer is processing.  This means we read from and write to the volatile sequence field (shown in bold) as infrequently as we can get away with.

In Summary
Memory barriers are CPU instructions that allow you to make certain assumptions about when data will be visible to other processes.  In Java, you implement them with the volatile keyword.  Using volatile means you don’t necessarily have to add locks willy nilly, and will give you performance improvements over using them.  However you need to think a little more carefully about your design, in particular how frequently you use volatile fields, and how frequently you read and write them.

PS Given that the New World Order in the Disruptor uses totally different naming conventions now to everything I’ve blogged about so far, I guess the next post is mapping the old world to the new one.

Dissecting the Disruptor: Why it's so fast (part two) – Magic cache line padding

We mention the phrase Mechanical Sympathy quite a lot, in fact it’s even Martin’s blog title.  It’s about understanding how the underlying hardware operates and programming in a way that works with that, not against it.

We get a number of comments and questions about the mysterious cache line padding in the RingBuffer, and I referred to it in the last post.  Since this lends itself to pretty pictures, it’s the next thing I thought I would tackle.

Comp Sci 101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something.  So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that.  But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.

So, a refresher for those of us who studied this at school, and an intro for those who didn’t.  Beware - this post contains massive over-simplifications.

The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program.  Main memory (RAM) is where your data (including the lines of your program) lives.  We’re going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.

The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow.  If you’re doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it’s performing the operation (think a loop counter - you don’t want to be going off to main memory to fetch this to increment it every time you loop around).

The closer the cache is to the CPU, the faster it is and the smaller it is.  L1 cache is small and very fast, and right next to the core that uses it.  L2 is bigger and slower, and still only used by a single core.  L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket.  Finally you have main memory, which is shared across all cores and all sockets.

When the CPU is performing an operation, it’s first going to look in L1 for the data it needs, then L2, then L3, and finally if it’s not in any of the caches the data needs to be fetched all the way from main memory.  The further it has to go, the longer the operation will take.  So if you’re doing something very frequently, you want to make sure that data is in L1 cache.

Martin and Mike’s QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to… Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory ~60-80ns
QPI transit
(between sockets, not drawn)
~20ns
L3 cache ~40-45 cycles, ~15ns
L2 cache ~10 cycles, ~3ns
L1 cache ~3-4 cycles, ~1ns
Register 1 cycle
If you’re aiming for an end-to-end latency of something like 10 milliseconds, an 80 nanosecond trip to main memory to get some missing data is going to take a serious chunk of that.
Cache lines
Now the interesting thing to note is that it’s not individual items that get stored in the cache - i.e. it’s not a single variable, a single pointer.  The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory.  A Java long is 8 bytes, so in a single cache line you could have 8 long variables.
(I’m going to ignore the multiple cache-levels for simplicity)
This is brilliant if you’re accessing an array of longs - when one value from the array gets loaded into the cache, you get up to 7 more for free.  So you can walk that array very quickly.  In fact, you can iterate over any data structure that is allocated to contiguous blocks in memory very quickly.  I made a passing reference to this in the very first post about the ring buffer, and it explains why we use an array for it.
So if items in your data structure aren’t sat next to each other in memory (linked lists, I’m looking at you) you don’t get the advantage of freebie cache loading.  You could be getting a cache miss for every item in that data structure.
However, there is a drawback to all this free loading.  Imagine your long isn’t part of an array.  Imagine it’s just a single variable.  Let’s call it head, for no real reason.  Then imagine you have another variable in your class right next to it.  Let’s arbitrarily call it tail.  Now, when you load head into your cache, you get tail for free.
Which sounds fine.  Until you realise that tail is being written to by your producer, and head is being written to by your consumer.  These two variables aren’t actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.
Imagine your consumer updates the value of head.  The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value.  And remember that we deal with the level of the whole line, we can’t just mark head as being invalid.
Now if some process running on the other core just wants to read the value of tail, the whole cache line needs to be re-read from main memory.  So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it’s slowed down by a cache miss.
Of course this is even worse if two separate threads are writing to the two different values. Both cores are going to be invalidating the cache line on the other core and having to re-read it every time the other thread has written to it. You’ve basically got write-contention between the two threads even though they’re writing to two different variables.
This is called false sharing, because every time you access head you get tail too, and every time you access tail, you get head as well.  All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that’s going to be very inefficient for concurrent access.
Our solution - magic cache line padding
You’ll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to ensure the ring buffer’s sequence number is never in a cache line with anything else.

public long p1, p2, p3, p4, p5, p6, p7 // cache line padding
    private volatile long cursor = INITIAL_CURSOR_VALUE
    public long p8, p9, p10, p11, p12, p13, p14 // cache line padding
So there’s no false sharing, no unintended contention with any other variables, no needless cache misses.
It’s worth doing this on your Entry classes too - if you have different consumers writing to different fields, you’re going to need to make sure there’s no false sharing between each of the fields.

EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.

Dissecting the Disruptor: Why it's so fast (part one) – Locks Are Bad

Martin Fowler has written a really good article describing not only the Disruptor, but also how it fits into the architecture at LMAX.  This gives some of the context that has been missing so far, but the most frequently asked question is still “What is the Disruptor?".

I’m working up to answering that.  I’m currently on question number two: “Why is it so fast?".

These questions do go hand in hand, however, because I can’t talk about why it’s fast without saying what it does, and I can’t talk about what it is without saying why it is that way.

So I’m trapped in a circular dependency.  A circular dependency of blogging.

To break the dependency, I’m going to answer question one with the simplest answer, and with any luck I’ll come back to it in a later post if it still needs explanation: the Disruptor is a way to pass information between threads.

As a developer, already my alarm bells are going off because the word “thread” was just mentioned, which means this is about concurrency, and Concurrency Is Hard.

Concurrency 101

Imagine two threads are trying to change the same value.

Case One: Thread 1 gets there first:

  1. The value changes to “blah"
  2. Then the value changes to “blahy” when Thread 2 gets there.
Case Two: Thread 2 gets there first:
  1. The value changes to “fluffy"
  2. Then the value changes to “blah” when Thread 1 gets there.
Case Three: Thread 1 interrupts Thread 2:
  1. Thread 2 gets the value “fluff” and stores it as myValue
  2. Thread 1 goes in and updates value to “blah"
  3. Then Thread 2 wakes up and sets the value to “fluffy”.
Case Three is probably the only one which is definitely wrong, unless you think the naive approach to wiki editing is OK (Google Code Wiki, I’m looking at you…).  In the other two cases it’s all about intentions and predictability.  Thread 2 might not care what’s in value, the intention might be to append “y” to whatever is in there regardless.  In this circumstance, cases one and two are both correct.

But if Thread 2 only wanted to change “fluff” to “fluffy”, then both cases two and three are incorrect.
Assuming that Thread 2 wants to set the value to “fluffy”, there are some different approaches to solving the problem.

Approach One: Pessimistic locking

(Does the “No Entry” sign make sense to people who don’t drive in Britain?)

The terms pessimistic and optimistic locking seem to be more commonly used when talking about database reads and writes, but the principal applies to getting a lock on an object.

Thread 2 grabs a lock on Entry as soon as it knows it needs it and stops anything from setting it. Then it does its thing, sets the value, and lets everything else carry on.

You can imagine this gets quite expensive, with threads hanging around all over the place trying to get hold of objects and being blocked.  The more threads you have, the more chance that things are going to grind to a halt.

Approach Two: Optimistic locking

In this case Thread 2 will only lock Entry when it needs to write to it.  In order to make this work, it needs to check if Entry has changed since it first looked at it.  If Thread 1 came in and changed the value to “blah” after Thread 2 had read the value, Thread 2 couldn’t write “fluffy” to the Entry and trample all over the change from Thread 1.  Thread 2 could either re-try (go back, read the value, and append “y” onto the end of the new value), which you would do if Thread 2 didn’t care what the value it was changing was; or it could throw an exception or return some sort of failed update flag if it was expecting to change “fluff” to “fluffy”.  An example of this latter case might be if you have two users trying to update a Wiki page, and you tell the user on the other end of Thread 2 they’ll need to load the new changes from Thread 1 and then reapply their changes.

Potential Problem: Deadlock
Locking can lead to all sorts of issues, for example deadlock.  Imagine two threads that need access to two resources to do whatever they need to do:

If you’ve used an over-zealous locking technique, both threads are going to sit there forever waiting for the other one to release its lock on the resource.  That’s when you reboot Windows your computer.

Definite Problem: Locks are sloooow…
The thing about locks is that they need the operating system to arbitrate the argument.  The threads are like siblings squabbling over a toy, and the OS kernel is the parent that decides which one gets it. It’s like when you run to your Dad to tell him your sister has nicked the Transformer"" when you wanted to play with it - he’s got bigger things to worry about than you two fighting, and he might finish off loading the dishwasher and putting on the laundry before settling the argument.  If you draw attention to yourself with a lock, not only does it take time to get the operating system to arbitrate, the OS might decide the CPU has better things to do than servicing your thread.

The Disruptor paper talks about an experiment we did.  The test calls a function incrementing a 64-bit counter in a loop 500 million times.  For a single thread with no locking, the test takes 300ms.  If you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms.  That’s, like, two orders of magnitude slower.  Even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms.  Incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock.

Concurrency Is Hard and Locks Are Bad
I’m just touching the surface of the problem, and obviously I’m using very simple examples.  But the point is, if your code is meant to work in a multi-threaded environment, your job as a developer just got a lot more difficult:

  • Naive code can have unintended consequences.  Case Three above is an example of how things can go horribly wrong if you don’t realise you have multiple threads accessing and writing to the same data.
  • Selfish code is going to slow your system down.  Using locks to protect your code from the problem in Case Three can lead to things like deadlock or simply poor performance.

This is why many organisations have some sort of concurrency problems in their interview process (certainly for Java interviews).  Unfortunately it’s very easy to learn how to answer the questions without really understanding the problem, or possible solutions to it.

How does the Disruptor address these issues?
For a start, it doesn’t use locks.  At all.

Instead, where we need to make sure that operations are thread-safe (specifically, updating the next available sequence number in the case of multiple producers), we use a CAS (Compare And Swap/Set) operation.  This is a CPU-level instruction, and in my mind it works a bit like optimistic locking - the CPU goes to update a value, but if the value it’s changing it from is not the one it expects, the operation fails because clearly something else got in there first.

Note this could be two different cores rather than two separate CPUs.

CAS operations are much cheaper than locks because they don’t involve the operating system, they go straight to the CPU.  But they’re not cost-free - in the experiment I mentioned above, where a lock-free thread takes 300ms and a thread with a lock takes 10,000ms, a single thread using CAS takes 5,700ms.  So it takes less time than using a lock, but more time than a single thread that doesn’t worry about contention at all.

Back to the Disruptor - I talked about the ClaimStrategy when I went over the producers.  In the code you’ll see two strategies, a SingleThreadedStrategy and a MultiThreadedStrategy.  You could argue, why not just use the multi-threaded one with only a single producer?  Surely it can handle that case?  And it can.  But the multi-threaded one uses an AtomicLong (Java’s way of providing CAS operations), and the single-threaded one uses a simple long with no locks and no CAS.  This means the single-threaded claim strategy is as fast as possible, given that it knows there is only one producer and therefore no contention on the sequence number.

I know what you’re thinking: turning one single number into an AtomicLong can’t possibly have been the only thing that is the secret to the Disruptor’s speed. And of course, it’s not - otherwise this wouldn’t be called “Why it’s so fast (part one)".

But this is an important point - there’s only one place in the code where multiple threads might be trying to update the same value.  Only one place in the whole of this complicated data-structure-slash-framework.  And that’s the secret.  Remember everything has its own sequence number?  If you only have one producer then every sequence number in the system is only ever written to by one thread. That means there is no contention.  No need for locks.  No need even for CAS.  The only sequence number that is ever written to by more than one thread is the one on the ClaimStrategy if there is more than one producer.

This is also why each variable in the Entry can only be written to by one consumer.  It ensures there’s no write contention, therefore no need for locks or CAS.

Back to why queues aren’t up to the job
So you start to see why queues, which may implemented as a ring buffer under the covers, still can’t match the performance of the Disruptor.  The queue, and the basic ring buffer, only has two pointers - one to the front of the queue and one to the end:

If more than one producer wants to place something on the queue, the tail pointer will be a point of contention as more than one thing wants to write to it.  If there’s more than one consumer, then the head pointer is contended, because this is not just a read operation but a write, as the pointer is updated when the item is consumed from the queue.

But wait, I hear you cry foul!  Because we already knew this, so queues are usually single producer and single consumer (or at least they are in all the queue comparisons in our performance tests).

There’s another thing to bear in mind with queues/buffers.  The whole point is to provide a place for things to hang out between producers and consumers, to help buffer bursts of messages from one to the other.  This means the buffer is usually full (the producer is out-pacing the consumer) or empty (the consumer is out-pacing the producer).  It’s rare that the producer and consumer will be so evenly-matched that the buffer has items in it but the producers and consumers are keeping pace with each other.

So this is how things really look.  An empty queue:

…and a full queue:

The queue needs a size so that it can tell the difference between empty and full.  Or, if it doesn’t, it might determine that based on the contents of that entry, in which case reading an entry will require a write to erase it or mark it as consumed.

Whichever implementation is chosen, there’s quite a bit of contention around the tail, head and size variables, or the entry itself if a consume operation also includes a write to remove it.

On top of this, these three variables are often in the same cache line, leading to false sharing.  So, not only do you have to worry about the producer and the consumer both causing a write to the size variable (or the entry), updating the tail pointer could lead to a cache-miss when the head pointer is updated because they’re sat in the same place.  I’m going to duck out of going into that in detail because this post is quite long enough as it is.

So this is what we mean when we talk about “Teasing Apart the Concerns” or a queue’s “conflated concerns”.  By giving everything its own sequence number and by allowing only one consumer to write to each variable in the Entry, the only case the Disruptor needs to manage contention is where more than one producer is writing to the ring buffer.

In summary
The Disruptor a number of advantages over traditional approaches:

  1. No contention = no locks = it’s very fast.
  2. Having everything track its own sequence number allows multiple producers and multiple consumers to use the same data structure.
  3. Tracking sequence numbers at each individual place (ring buffer, claim strategy, producers and consumers), plus the magic cache line padding, means no false sharing and no unexpected contention.
EDIT: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.

Dissecting the Disruptor: Wiring up the dependencies

So now I’ve covered the ring buffer itself, reading from it and writing to it.

Logically the next thing to do is to wire everything up together.

I talked about multiple producers - they have the producer barrier to keep them in order and under control.  I’ve talked about consumers in a simple situation.  Multiple consumers can get a little more involved.  We’ve done some clever stuff to allow the consumers to be dependent on each other and the ring buffer.  Like a lot of applications, we have a pipeline of things that need to happen before we can actually get on with the business logic - for example, we need to make sure the messages have been journalled to disk before we can do anything.

The Disruptor paper and the performance tests cover some basic configurations that you might want. I’m going to go over the most interesting one, mostly because I needed the practice with the graphics tablet.

Diamond configuration
DiamondPath1P3CPerfTest illustrates a configuration which is not too uncommon - a single producer with three consumers.  The tricky point being that the third consumer is dependent upon the previous two consumers to finish before it can do anything.

Consumer three might be your business logic, consumer one could be backing up the data received, and consumer two may be preparing the data or something.

Diamond configuration using queues
In a SEDA-style architecture, each stage will be separated by a queue:

(Why does queue have to have so many “e"s?  It’s the letter I have the most trouble with in these drawings).

You might get an inkling of the problem here: for a message to get from P1 to C3 it has to travel through four whole queues, each queue taking its cost in terms of putting the message on the queue and taking it off again.

Diamond configuration using the Disruptor
In the Disruptor world, it’s all managed on a single ring buffer:

It does look more complicated.  But the ring buffer remains the single point of contact between all the players, and the interactions are all based on the barriers checking the sequence numbers of the things it’s dependent upon.

The producer side is fairly simple, it’s the single producer model described in my last post. Interestingly, the producer barrier doesn’t have to care about all the consumers.  It only cares about consumer three, because if consumer three has finished with an item in the ring buffer the other two will already have processed it.  So if C3 has moved on, that slot in the ring buffer is available.

To manage the dependencies between the consumers you need two consumer barriers.  The first just talks to the ring buffer and consumers one and two ask it for the next available item.  The second consumer barrier knows about consumers one and two, and it will return the lowest sequence number processed by both consumers.

How consumer dependencies work in the Disruptor
Hmm.  I can see I’m going to need an example.

We’re joining the party halfway through the story: the producer has filled the ring buffer up to sequence number 22; consumer one has read and processed everything up to 21; consumer two has processed everything up to sequence 18; consumer three, which is dependent upon the other consumers, has only made it as far as 15.

The producer can’t write anything more to the ring buffer because sequence 15 is taking up the slot where we’d want to put sequence 23.

(I’m sorry, I really did try to find an alternative to red and green, but everything else was just as ambiguous).

The first consumer barrier lets consumers one and two know they can grab anything up to sequence 22, the highest sequence number in the ring buffer.  The second consumer barrier checks the ring buffer sequence, but it also checks the sequences on the other two consumers and returns the lowest value.  So consumer three is told it can get anything up to sequence 18 from the ring buffer.

Note that the consumers are still reading the entries directly from the ring buffer - consumers one and two are not taking the entries off the ring buffer and then passing them on to consumer three.  Instead, the second consumer barrier is letting consumer three know which entry in the ring buffer it’s safe to process.

This raises a question - if everything comes directly off the ring buffer, how is consumer three going to find out about anything the first two consumers have done?  If all consumer three cares about is that the earlier consumers have done their job (e.g. replicating the data to somewhere else) then everything’s fine - when consumer three is told the job is done, it’s happy.  If, however, consumer three needs the results of an earlier consumer’s processing, where does it get that from?

Modifying entries
The secret is to write them to the ring buffer Entry itself.  This way, when consumer three grabs the entry off the ring buffer, it will have been populated with all the information consumer three needs to do the job.  The really important part of this is that for each field on the Entry only one consumer is allowed to write to it.  This prevents any write-contention which will slow the whole thing down.

You can see this in DiamondPath1P3CPerfTest - FizzBuzzEntry has two fields as well as the value: fizz and buzz.  If the consumer is a Fizz consumer, it writes to fizz.  If it’s a Buzz consumer, it writes to buzz.  The third consumer, FizzBuzz, will read both of these fields but not write to either, since reading is fine and won’t cause contention.

Some actual Java code
All this looks more complicated than the queue implementation.  And yes, it does involve a bit more coordination.  But this is hidden from the consumers and producers, they just talk to the barriers.  The trick is in the configuration.  The diamond graph in the example above would be created using something like the following:

ConsumerBarrier consumerBarrier1 = ringBuffer.createConsumerBarrier();

BatchConsumer consumer1 = new BatchConsumer(consumerBarrier1, handler1);
BatchConsumer consumer2 = new BatchConsumer(consumerBarrier1, handler2);

ConsumerBarrier consumerBarrier2 =
ringBuffer.createConsumerBarrier(consumer1, consumer2);

BatchConsumer consumer3 = new BatchConsumer(consumerBarrier2, handler3);

ProducerBarrier producerBarrier =
ringBuffer.createProducerBarrier(consumer3);

In summary
So there you have it - how to wire up the Disruptor with multiple consumers that are dependent on each other.  The key points:

  • Use multiple consumer barriers to manage dependencies between consumers.
  • Have the producer barrier watch the last consumer in the graph.
  • Allow only one consumer to write to an individual field in an Entry.

EDIT: Adrian has written a nice DSL to make wiring up the Disruptor much easier.

EDIT 2: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.  Also Adrian’s DSL is now part of the main Disruptor code base.

    Dissecting the Disruptor: Writing to the ring buffer

    This is the missing piece in the end-to-end view of the Disruptor.  Brace yourselves, it's quite long.  But I decided to keep it in a single blog so you could have the context in one place.

    The important areas are: not wrapping the ring; informing the consumers; batching for producers; and how multiple producers work.

    ProducerBarriers
    The Disruptor code has interfaces and helper classes for the Consumers, but there's no interface for your producer, the thing that writes to the ring buffer.  That's because nothing else needs to access your producer, only you need to know about it.  However, like the consuming side, a ProducerBarrier is created by the ring buffer and your producer will use this to write to it.

    Writing to the ring buffer involves a two-phase commit.  First, your producer has to claim the next slot on the buffer.  Then, when the producer has finished writing to the slot, it will call commit on the ProducerBarrier.

    So let's look at the first bit.  It sounds easy - "get me the next slot on the ring buffer".  Well, from your producer's point of view it is easy.  You simply call nextEntry() on the ProducerBarrier.  This will return you an Entry object which is basically the next slot in the ring buffer.

    The ProducerBarrier makes sure the ring buffer doesn't wrap
    Under the covers, the ProducerBarrier is doing all the negotiation to figure out what the next slot is, and if you're allowed to write to it yet.

    (I'm not convinced the shiny new graphics tablet"" is helping the clarity of my pictures, but it's fun to use).

    For this illustration, we're going to assume there's only one producer writing to the ring buffer.  We will deal with the intricacies of multiple producers later.

    The ConsumerTrackingProducerBarrier has a list of all the Consumers that are accessing the ring buffer.  Now to me this seemed a bit odd - I wouldn't expect the ProducerBarrier to know anything about the consuming side. But wait, there is a reason.  Because we don't want the "conflation of concerns" a queue has (it has to track the head and tail which are sometimes the same point), our consumers are responsible for knowing which sequence number they're up to, not the ring buffer.  So, if we want to make sure we don't wrap the buffer, we need to check where the consumers have got to.

    In the diagram above, one Consumer is happily at the same point as the highest sequence number (12, highlighted in red/pink). The second Consumer is a bit behind - maybe it's doing I/O operations or something - and it's at sequence number 3.  Therefore consumer 2 has the whole length of the buffer to go before it catches up with consumer 1.

    The producer wants to write to the slot on the ring buffer currently occupied by sequence 3, because this slot is the one after the current ring buffer cursor.  But the ProducerBarrier knows it can't write here because a Consumer is using it.  So the ProducerBarrier sits and spins, waiting, until the consumers move on.

    Claiming the next slot
    Now imagine consumer 2 has finished that batch of entries, and moves its sequence number on. Maybe it got as far as sequence 9 (in real life I expect it will make it as far as 12 because of the way consumer batching works, but that doesn't make the example as interesting).

    The diagram above shows what happens when consumer 2 updates to sequence number 9.  I've slimmed down the ConsumerBarrier in this picture because it takes no active part in this scene.

    The ProducerBarrier sees that the next slot, the one that had sequence number 3, is now available.  It grabs the Entry that sits in this slot (I've not talked specifically about the Entry class, but it's basically a bucket for stuff you want to put into the ring buffer slot which has a sequence number), sets the sequence number on the Entry to the next sequence number (13) and returns this entry to your producer.  The producer can then write whatever value it wants into this Entry.

    Committing the new value
    The second phase of the two-stage commit is, well, the commit.

    The green represents our newly updated Entry with sequence 13 - yeah, I'm sorry, I'm red-green colour-blind too.  But other colours were even more rubbish.

    When the producer has finished writing stuff into the entry it tells the ProducerBarrier to commit it.

    The ProducerBarrier waits for the ring buffer cursor to catch up to where we are (for a single producer this will always be a bit pointless - e.g. we know the cursor is already at 12, nothing else is writing to the ring buffer).  Then the ProducerBarrier updates the ring buffer cursor to the sequence number on the updated Entry - 13 in our case.  Next, the ProducerBarrier lets the consumers know there's something new in the buffer.  It does this by poking the WaitStrategy on the ConsumerBarrier - "Oi, wake up! Something happened!" (note - different WaitStrategy implementations deal with this in different ways, depending upon whether it's blocking or not).

    Now consumer 1 can get entry 13, consumer 2 can get everything up to and including 13, and they all live happily ever after.

    ProducerBarrier batching
    Interestingly the disruptor can batch on the producer side as well as on the Consumer side.  Remember when consumer 2 finally got with the programme and found itself at sequence 9?  There is a very cunning thing the ProducerBarrier can do here - it knows the size of the buffer, and it knows where the slowest Consumer is.  So it can figure out which slots are now available.

    If the ProducerBarrier knows the ring buffer cursor is at 12, and the slowest Consumer is at 9, it can let producers write to slots 3, 4, 5, 6, 7 and 8 before it needs to check where the consumers are.

    Multiple producers
    You thought I was done, but there's more.

    I slightly lied in some of the above drawings.  I implied that the sequence number the ProducerBarrier deals with comes directly from the ring buffer's cursor.  However, if you look at the code you'll see that it uses the ClaimStrategy to get this.  I skipped this to simplify the diagrams, it's not so important in the single-producer case.

    With multiple producers, you need yet another thing tracking a sequence number.  This is the sequence that is available for writing to.  Note that this is not the same as ring-buffer-cursor-plus-one - if you have more than one producer writing to the buffer, it's possible there are entries in the process of being written that haven't been committed yet.

    Let's revisit claiming a slot.  Each producer asks the ClaimStrategy for the next available slot.  Producer 1 gets sequence 13, like in the single producer case above.  Producer 2 gets sequence 14, even though the ring buffer cursor is still only pointing to 12, because the ClaimSequence is dishing out the numbers and has been keeping track of what's been allocated.

    So each producer has its own slot with a shiny new sequence number.

    I'm going colour producer 1 and its slot in green, and producer 2 and its slot in a suspiciously pink-looking purple.

    Now imaging producer 1 is away with the fairies, and hasn't got around to committing for whatever reason.  Producer 2 is ready to commit, and asks the ProducerBarrier to do so.
    As we saw in the earlier commit diagram, the ProducerBarrier is only going to commit when the ring buffer cursor reaches the slot behind the one it wants to commit into.  In this case, the cursor needs to reach 13 so that we can commit 14.  But we can't, because producer 1 is staring at something shiny and hasn't committed yet.  So the ClaimStrategy sits there spinning until the ring buffer cursor gets to where it should be.

    Now producer 1 wakes up from its coma and asks to commit entry 13 (green arrows are sparked by the request from producer 1).  The ProducerBarrier tells the ClaimStrategy to wait for the ring buffer cursor to get to 12, which it already had of course.  So the ring buffer cursor is incremented to 13, and the ProducerBarrier pokes the WaitStrategy to let everything know the ring buffer was updated.  Now the ProducerBarrier can finish the request from producer 2, increment the ring buffer cursor to 14, and let everyone know that we're done.

    You'll see that the ring buffer retains the ordering implied by the order of the initial nextEntry() calls, even if the producers finish writing at different times.  It also means that if a producer is causing a pause in writing to the ring buffer, when it unblocks any other pending commits can happen immediately.

    Phew.  And I managed to describe all that without mentioning a memory barrier once.

    EDIT: The most recent version of the RingBuffer hides away the Producer Barrier.  If you can't see a ProducerBarrier in the code you're looking at, then assume where I say "producer barrier" I mean "ring buffer"

    EDIT 2: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.

    Dissecting the Disruptor: How do I read from the ring buffer?

    The next in the series of understanding the Disruptor pattern developed at LMAX.

    After the last post we all understand ring buffers and how awesome they are.  Unfortunately for you, I have not said anything about how to actually populate them or read from them when you’re using the Disruptor.

    ConsumerBarriers and Consumers
    I’m going to approach this slightly backwards, because it’s probably easier to understand in the long run.  Assuming that some magic has populated it: how do you read something from the ring buffer?

    (OK, I’m starting to regret using Paint/Gimp.  Although it’s an excellent excuse to purchase a graphics tablet if I do continue down this road.  Also UML gurus are probably cursing my name right now.)

    Your Consumer is the thread that wants to get something off the buffer.  It has access to a ConsumerBarrier, which is created by the RingBuffer and interacts with it on behalf of the Consumer.  While the ring buffer obviously needs a sequence number to figure out what the next available slot is, the consumer also needs to know which sequence number it’s up to - each consumer needs to be able to figure out which sequence number it’s expecting to see next.  So in the case above, the consumer has dealt with everything in the ring buffer up to and including 8, so it’s expecting to see 9 next.

    The consumer calls waitFor on the ConsumerBarrier with the sequence number it wants next

        final long availableSeq = consumerBarrier.waitFor(nextSequence);

    and the ConsumerBarrier returns the highest sequence number available in the ring buffer - in the example above, 12.  The ConsumerBarrier has a WaitStrategy which it uses to decide how to wait for this sequence number - I won’t go into details of that right now, the code has comments in outlining the advantages and disadvantages of each.

    Now what?
    So the consumer has been hanging around waiting for more stuff to get written to the ring buffer, and it’s been told what has been written - entries 9, 10, 11 and 12.  Now they’re there, the consumer can ask the ConsumerBarrier to fetch them.

    As it’s fetching them, the Consumer is updating its own cursor.

    You should start to get a feel for how this helps to smooth latency spikes - instead of asking “Can I have the next one yet?  How about now?  Now?” for every individual item, the Consumer simply says “Let me know when you’ve got more than this number”, and is told in return how many more entries it can grab.  Because these new entries have definitely been written (the ring buffer’s sequence has been updated), and because the only things trying to get to these entries can only read them and not write to them, this can be done without locks.  Which is nice.  Not only is it safer and easier to code against, it’s much faster not to use a lock.

    And the added bonus - you can have multiple Consumers reading off the same RingBuffer, with no need for locks and no need for additional queues to coordinate between the different threads.  So you can really run your processing in parallel with the Disruptor coordinating the effort.

    The BatchConsumer is an example of consumer code, and if you implement the BatchHandler you can get the BatchConsumer to do the heavy lifting I’ve outlined above.  Then it’s easy to deal with the whole batch of entries processed (e.g. from 9-12 above) without having to fetch each one individually.

    EDIT: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.