See also Aeron[1] which is basically a networked superset of the Disruptor, featuring (IIRC) a just-as-fast machine-local implementation. Made by some of the same people as the LMAX Disruptor. It's not quite ready for general use AFAICT, but it's quickly getting there...
There's also a really informative presentation[2] by Martin Thompson on some of the ideas that they're using to get really low latency and high throughput.
I really don't want to put this down. This looks very interesting and is nicely presented. But I have trouble recognizing the main idea(s) behind the "LMAX Disruptor". To me, this all boils down to:
"You get better performance if you implement your event queue via a pre-allocated ring buffer instead of (dynamically allocated) linked-lists or arrays."
Its been many years since I saw the presentation but I believe they made a point of mentioning that ring buffers are not a new technology. They mentioned how such data structures are used all the time in network devices, such as switches.
The disruptor is an actual library based on ring buffers (other have already pointed out some other features).
In practical terms, this library has been immensely popular. Within a couple years of disruptor's release, practically every (java based) trading firm was using them. I've even seen c++ implementations, only vaguely resembling this library, being referred to as 'disruptors.'
Beyond the library, the original (and subsequent) presentations by the authors popularized the concepts of "mechanical sympathy" (understand the abstraction layer beneath what you are currently working in) and introduced a whole new set of developers to modern CPUs, thinking about cache lines, etc.
What you are missing is that with the disruptor pattern you can do this lock-less while keeping the order: The producer gets a sequence number before accessing the space in the queue. This seqno guarantees the order of consumption so there will be no reordering of events on the consumer side. After creating the message, the producer commits this id. Meanwhile the consumer waits for the next seqno to become committed and consumes it.
I have implemented a multiple producer, single consumer variant in C for logging and it works quite well. Loggers can write their message concurrently to the ring buffer, but the order will be correct in the log file.
IIRC, the data structure that they used was not thread-safe. There eventually had to be an interlocked compare exchange operation which slowed throughput. Also, I couldn't imagine this working correctly in out-of-order cores without memory barriers at the very least.
Yes, you need to use CAS and barriers to make it thread safe. I have also used a condvar for the consumer to avoid spinning while waiting for log messages.
The Disruptor is an example of solid engineering - the fact that there's nothing "new" is a point in its favor. Its creators acknowledge that similar approaches are common in network switches, etc. I would say these techniques aren't well known among Java programmers. Not new != widely known.
Mostly yes. Perhaps the thing to note is that the Disruptor moves the logic for manipulating the the indices outside of the ringbuffer itself. Allowing multiple independent consumers to exist.
The idea behind he Disruptor is to move data between threads in a contention free manor.
ArrayBlockingQueue [1] uses a ring buffer backed by a fixed-size array, too. It is really hard to say where the speed-up comes from without seeing the code, but it is certainly not from using a ring buffer because both of the compared implementations do so. I guess they just were more careful when they designed the auxiliary state and the operation on it, for example ArrayBlockingQueue explicitly tracks the number of elements in the queue with a separate field - instead of deriving it from the insert and remove pointers - and therefore this field is contented by readers and writers. Also preallocating entries for all slots to work around Java's lack of support for arrays of value types certainly removes some overhead from the allocator.
EDIT: Just realized that the code is on GitHub - will have a look.
That is why I mentioned it in the first place. The paper reads very much like they think using a ring buffer provides a performance advantage over other queue implementations but then they compare two implementations using ring buffers. So their improved performance stems definitely not from using a ring buffer but from implementations details like the synchronization mechanisms used or preallocating and reusing queue entries.
It's been a while, but didn't the performance turn out to be similar to other data structures in multi-socket scenarios because it requires interlocked -type instructions?
There's also a really informative presentation[2] by Martin Thompson on some of the ideas that they're using to get really low latency and high throughput.
[1] https://github.com/real-logic/Aeron
[2] https://www.infoq.com/presentations/aeron-messaging