This post is missing a key trade-off that you get with an in-memory row store (t...

colanderman · on Oct 2, 2014

As someone who knows nothing about database implementations but a lot about memory, this makes no sense:

> atomic instructions are available only for data that's in memory

All CPU instructions are available only for data that's in memory (or registers). No (sane) database on Earth operates solely by issuing disk I/O.

That linked blog post doesn't support your claim, either. It makes some points about "indirections" and "buffer caches", neither of which is relevant to the applicability of lock-free techniques or atomic instructions.

(Quick proof of both of the above: all virtual memory systems abstract both (a) disk access (swap) and (b) pointer indirection (TLBs), without restricting atomic operations.)

What is true, is that atomic operations, and memory barriers for that matter, are not effective across coherency domains. So, if you have two processes concurrently accessing a memory-mapped block device on different computers, then yes, you lose atomic operations and lock-free data structures. But, AFAIK, this is not how most databases operate: most run as one or several processes within the same coherency domain (e.g. the same computer). The operating system ensures that multiple processes mapping the same file maintain a coherent view of it (primarily by mapping each disk block to at most one physical memory location).

ankrgyl · on Oct 2, 2014

Happy to fill in the gaps here. Building a data structure that is lock free, performant at concurrent reads and writes, and fast at scanning data is very challenging. Skip lists enable us to implement this in a straightforward way, but are aggressive with respect to random access. You can indeed leverage swap to extend a lock free skip list onto the disk, but the fact that a skip list links together nodes with pointers does not play very well with swap. Swap performance for random access is an extremely well known problem in Linux.

Disk-based databases reduce their dependence on random access by using data structures like B-Trees, which do a much better job (i.e. maintain good scan and seek performance) as they spill on and off the disk. These data structures, however, are not feasible to implement lock free.

jeffdavis · on Oct 2, 2014

Do you think this is a fundamental trade-off, or does it just happen that the structures we know about that are good at lock-free happen to have a lot of random access?

Kind of a loaded question, because obviously new research might reveal new algorithms. I'm just trying to get an idea whether you think there is a fundamental connection between lock-free and random-access structures.

colanderman · on Oct 2, 2014

So it's not a disk vs. memory thing, it's a B-tree vs. skiplist thing. That makes a lot more sense.

latch · on Oct 2, 2014

I'm pretty sure the two are related, no? A B-trees is a disk-friendly structure (shallow and wide), skiplist aren't (but they are a concurrency friendly structure)

colanderman · on Oct 2, 2014

Right. My point is that "you can't use atomic instructions with disk-backed databases" is oversimplified to the point of being incorrect: it's wholly feasible for, say, Postgres to implement a skiplist-based index (or some other lock-free data structure) while still backing its store with the disk, with the only downside being a performance hit if working memory is not large enough.

dschiptsov · on Oct 2, 2014

How exactly you have solved the problem of "durabity" given that any disk is orders of magnitude longer to access than RAM? The Mongo way - push buffers to be written to an OS kernel and forgot?)

Let's put it straight. Words should mean what they originally meant, not how sales or PR guys want them to be meant.

For us, a database is piece of software which enforces data consistency and is fail-safe. That's why we want it - to reliable store our data.

According the laws of physics, the throughput is bounded by the slowest operation (the same way a durability of a chain is bounded by a weakest cell) and in case of a database, this operation is an atomic update in a persistent, failsafe storage with consistency guarantee. That is what we call a transaction. This was the meaning since System R or the first version of Berkeley DB.

Transaction "in memory" is a nonsense for us, so are any memory-to-network-stack or memory-to-client-process "benchmarks".

As long as there is no proven (not advertised) consistency guarantee and fail safety guarantee such system should be called a Cache, not a Database.

Call your product Cache, and all the confusion would disappear. In other words, "in-memory" is not a database, it contradicts with the meaning of the word, no mater how your sales people are trying to twist the language.

There is no "specialized databases". There are specialized caches. OK.

robmccoll · on Oct 2, 2014

Do you mean forcing flushes of mmaped regions to disk via an fsync call? The kernel isn't really hiding anything from you there, but if you want to go to that level, then what about disk caches too? Do you trust consensus protocols or is replication also not durable enough?

jcrites · on Oct 2, 2014

I think he's making a fair point. If the result of a write is that data is stored only in memory on a single machine, and is not written to a more-durable place as a blocking part of the write operation, then I would consider this data store a cache. Because at the point when my application returns from the operation to store the data, the data can still be lost due to a number of common failure scenarios.

If the data is replicated across a number of machines, then determining the durably is a more complex task. If sending data to the cluster blocks the user's write on a response from the quorum of the cluster, then I think the cluster can be considered reliable data store, even if the cluster nodes don't necessarily write to disk immediately or at all. Perhaps they keep the data in memory forever, and there are enough replicas that the chance of data loss (through all replicas being down at once) is low enough to accept. (It's conceptually similar to replicating across many disks and losing all the disks at once, although I agree, it seems intuitively a lot more fragile than disk based replication)

A more realistic model might be in-memory replication by quorum followed by buffered writes to disk on all replicas too. Since there are many replicas, it's not unreasonable to "persist" data in RAM on the many repliacs, but we also get the full benefits of disk persistence after a short time.

Just thinking out loud here.

ignoramous · on Oct 2, 2014

You're right that memSQL absoutely crawls when it is set to "fully-durable" mode where it has to sync write to the disk. But it was never built for the use-case, and I guess that's missing the point.

memSQL, AFAIK, was built for highly concurrent reads and writes. It uses lock-free in-memory data structures, couple that with an eventually-durable mode, in addition to optimizing queries by the way of re-compiling SQL to CPP (something similar to what HHVM does for PHP), I think it would result in performance gains in certain (and not all) use-cases.

memSQL, if we dare simplify, is really a high-throughput cache front-ending an eventually-durable datastore.

I wonder how would RocksDB + a disk-backed DB measure upto memSQL.

Ref: http://dom.as/2012/06/26/memsql-rage/

dschiptsov · on Oct 2, 2014

Excuse me, does eventually-durable means some time yes, some time no?)

To put in another way - does any bank would accept eventually-durable as its financial transaction storage?

More seriously, without error propagation to the client, such that it could "know immediately" which one of his transactions failed, there is no durability at all.

Ask all these telecom guys - without immediately returned ASK packets and timeouts a protocol cannot be called "reliable".

ignoramous · on Oct 3, 2014

I don't work for MemSQL. I know what I know from various posts on Quora, and the MemSQL engg blog itself. So take everything I say with a grain of salt.

MemSQL has two modes (if I am right): synchronous and asynchronous. I believe, if you're a bank, you'd want the first one. But if high throughput is your priority, you can afford to sacrifice durability for that, I think MemSQL would be an extremely fast solution compared with other disk-first datastores.

One thing that memory-first datastores (MemSQL, VoltDB) would be really good at is lots of reads.

xtacy · on Oct 2, 2014

I'd appreciate if you could have a new blogpost with details about the experiment so the community can verify the claims/reproduce it with other DBMS solutions.

Also, while I understand the statement about "lock-free", it seems to suggest the code will be as fast as main memory access, which it most certainly will not (due to cache coherency protocols kicking in while doing cswap, locked-add, etc.) Your comment doesn't say this, but I am noting it here for others.