«An asynchronous, event-driven architecture based on highly optimized coroutine ...

coffeemug · on Jan 31, 2013

This is a great question. We start a thread per core, and multiplex thousands of coroutines/events on each thread. When coroutines on different threads need to communicate, we send a message via a highly optimized message bus, so cross-thread communication code is localized. This means each thread is lock-free (i.e. when a coroutine needs to communicate with another coroutine, it sends a message and yields, so the CPU core can process other pending tasks). The code isn't wait-free -- a coroutine might have to wait, but it never ever locks the CPU core itself. So, as long as there is more work to do, the CPU will always be able to do it.

If instead we used threads + locking like traditional systems, we'd have to deal with "hot locks" that block out entire cores. Effectively we solved this problem once and for all, while systems that use threads + locks (like the linux kernel) have to continuously solve it by making sure locks are extremely granular.

JulianMorrison · on Feb 1, 2013

Sounds very Erlang-ish. Did you copy that deliberately?

coffeemug · on Feb 1, 2013

We do effectively have an ad-hoc mini Erlang runtime that we wrote at the core of the system. I'm not sure how deliberate that was -- we sort of borrowed performance ideas from many places, tried a lot of different approaches, and settled on this one. Lots of this was definitely inspired by ideas from Erlang.

gruseom · on Feb 1, 2013

There definitely seems to be a version of Greenspun's Tenth Rule for Erlang. But I think Greenspunning has gotten too bad a name – sometimes implementing a subset of a classical system is exactly what you ought to do, for example when your problem allows you to exploit certain invariants that don't hold in the general case, or for some reason using the classical system itself (Erlang in this case) is not an option.

coffeemug · on Feb 1, 2013

Right! Rethink has an adhoc Erlang runtime for message processing, and an adhoc lisp for the query language. I'm both ashamed and proud of this at the same time :)

shrughes · on Jan 31, 2013

You get less overhead by using less often the low-level concurrency primitives that involve cross-core synchronization. Cross-core synchronization happens in rethinkdb mainly when you see an on_thread_t object constructed or destroyed (and also in a few other places) and those get batched when you have more than one per event loop (which is not necessarily good, inflated queue sizes is also something to be wary of). So if you want to attach a bunch of network cards and high-speed storage devices on opposite ends of a handful of CPUs, your throughput won't be hindered by the fact that millions of threads are trying to talk to one another.