Re: MemSQL the “world's fastest database”? (2012)

jandrewrogers · on Oct 2, 2014

A well-designed disk-oriented database engine operating in cache will often be as fast as an in-memory database engine.

I once designed an in-memory database kernel for a supercomputer. One of the most fun designs I ever did. However, I also learned a lot about the realities and nuances of in-memory designs that are magnified on a supercomputer but still have large impacts on modern server hardware.

The biggest insight was this: memory behaves like complex block storage, with everything implied with respect to optimization, if you really care about performance. You can easily lose 10x the theoretical throughput by treating in-memory structures as random access storage.

In disk-backed systems, we tend to ignore this aspect of memory because, relative to disks, memory really is uniformly random-access to a first approximation. But disk-based systems are inherently optimized around multi-tier block access patterns. In a database kernel without disks, treating memory access appropriately (i.e. not random access) becomes a major point of optimization with huge gains if you do it right. And even in systems with highly optimized disk I/O subsystems (i.e. no POSIX I/O calls or kernel caching), treating memory as random access starts to put a serious drag on overall system performance.

In-memory databases allow you to avoid dealing with disk I/O which gives a major boost in performance. However, if performance matters, it does not let you treat memory as a random access storage device. Understanding how to design algorithms and data structures that optimize memory I/O patterns often has as much performance uplift relative to naive random access assumptions as there is when going from a disk I/O to in-memory. Most in-memory systems never take the second step.

jbert · on Oct 2, 2014

> The biggest insight was this: memory behaves like complex block storage, with everything implied with respect to optimization, if you really care about performance.

Could you describe the factors causing that? Is it just the cacheing hierarchy or are there other effects?

jandrewrogers · on Oct 2, 2014

The biggest effects are varying page sizes as bits move through silicon and latencies to memory access, which reflects cache hierarchies but also NUMA, DIMMs, and other considerations with respect to how memory is connected to the system and the topology of how memory moves into the local CPU cache. Due to the physical size of supercomputers, the latency variance for "flat" cache coherent memory access can become so bad that it has pathological characteristics on algorithms that work adequately with lower latencies. With how fast and efficient CPUs are today, on single system boards this has a huge effect. For example, it is not uncommon to lock a core to physical memory that is directly attached to that CPU's memory channels instead of some other CPU's memory channels in databases.

Some performance-sensitive open source libraries do varying amounts of memory topology aware optimization. The rudimentary tricks can often be worth a 2x performance boost for in-memory processing.

In the same way that really clever schedulers and disk I/O systems (they don't let the OS do it) in good databases try to optimize the average access time and variance, high-performance in-memory systems use very similar techniques to optimize the average access time to any particular piece of memory. The techniques are very similar in the abstract but there is a lot less written about the nuances of in-memory scheduling; it often involves understanding lower level details of the silicon than programmers are used to studying. The net effect is a large reduction in busy stalls and contention.

Also, people tend to forget that even though lock-free algorithms are fast on average, they tend to have a lot of overhead in systems with many cores unless they are used sparingly and accessed relatively infrequently. Eliminating that waste can have large performance benefits.

To your question: In order to minimize the need for thread coordination and optimize memory scheduling, the first thing you need to do is design algorithms and data structures that approximately match the topology of your storage, whether memory or disk or both. And that rarely looks like idealized random access memory. Then you optimize for locality of execution. I've designed database kernels that had thousands of parallel threads running. It turns out that the techniques used to eliminate almost all coordination in that case produce extremely efficient database kernels on a commodity server with vastly smaller resources.

Understanding the topology of real silicon systems is a large part of "optimization".

znep · on Oct 2, 2014

That makes a lot of sense. Regardless of where your data is, you want to best map your accesses to the underlying reality of whatever hardware is in use. And memory has significant locality on any even moderately big system.

I do think this glosses over the difference in data structures that someone focusing on efficiently storing and reading the data to and from disk might use versus the data structures that someone focused mainly on efficiently representing the data in memory might use, ignoring durability. Queries are one part of this, but dealing with updates and indexing can also be quite important.

I don't know if that difference is fundamental to the design of a database or just more of a lingering consequence of many databases being designed around most data being on disk, but it is what I observe looking at a variety of current database solutions.

Getting back to the main article here, I've been doing some basic testing against MemSQL today and, "world's fastest" aside, I like a lot of what I see, other than painfully long query parse/compile times. It does, however, appear to be true for my queries that most of the performance benefits are due to the distributed query engine and not due to any fundamental data structure differences compared to something like postgres or mysql/innodb. But my queries are very anti index.

SpaceCurve also sounds interesting, hopefully we can firm up our use cases enough and get far enough along in a technical evaluation that I can find time to play with it.

otakucode · on Oct 2, 2014

For complex optimization problems like this, why do I never heard about using the significant research done on automatic optimization being used? I would expect each architecture to have particular optimal strategies of memory usage that would be able to be discovered with the machine learning tools used to solve other sorts of optimization problems. I take it from your posts that there isn't a tool you can load up, run it for a few days, then export the most successful strategies it found and then use for parameters obeyed by the system being implemented? Is there a particular reason why? Is the complexity far greater than I'm presuming?

jbert · on Oct 2, 2014

Thanks for that. I hadn't realised that most (large) commodity systems these days had a NUMA setup. I thought it was relegated to the more esoteric end of things (super computer setup, single-system-image clusters).

Thanks - I can see that blindly treating NUMA memory as "uniform random access" is going to hurt.

n1ghtmare_ · on Oct 2, 2014

Wow, dude, your post made me feel VERY incompetent. This happens a lot on HN.

ankrgyl · on Oct 2, 2014

This post is missing a key trade-off that you get with an in-memory row store (tl;dr ability to build lock-free), but also since 6/25/2012 several other innovations in MemSQL that have shipped over the past 2+ years.

The main advantage of having a main-memory row store is the ability to leverage random-access data in memory. This unlocks you to use random access aggressively and build lock-free indexes (atomic instructions are available only for data that's in memory). With these tradeoffs you can build extremely concurrent data structures that simply aren't possible with a system that must manage a buffer cache. See http://blog.memsql.com/the-story-behind-memsqls-skiplist-ind... for more details.

The article also suggests that MemSQL is not durable in the event of a power outage. This is simply not the case - we've had durability since the first ever public release of MemSQL, including both asynchronous and synchronous modes.

MemSQL now also has distributed transactions and MPP, supports intra and inter data-center replication, semi-structured (JSON) data, and even has a column store that works as a hybrid between disk and in-memory.

And BTW, I actually agree with this mentality in the context of column stores. There's no real advantage to having an "in-memory" column store because of the append-only nature of columnar blobs of data. That's why ours is a hybrid.

colanderman · on Oct 2, 2014

As someone who knows nothing about database implementations but a lot about memory, this makes no sense:

> atomic instructions are available only for data that's in memory

All CPU instructions are available only for data that's in memory (or registers). No (sane) database on Earth operates solely by issuing disk I/O.

That linked blog post doesn't support your claim, either. It makes some points about "indirections" and "buffer caches", neither of which is relevant to the applicability of lock-free techniques or atomic instructions.

(Quick proof of both of the above: all virtual memory systems abstract both (a) disk access (swap) and (b) pointer indirection (TLBs), without restricting atomic operations.)

What is true, is that atomic operations, and memory barriers for that matter, are not effective across coherency domains. So, if you have two processes concurrently accessing a memory-mapped block device on different computers, then yes, you lose atomic operations and lock-free data structures. But, AFAIK, this is not how most databases operate: most run as one or several processes within the same coherency domain (e.g. the same computer). The operating system ensures that multiple processes mapping the same file maintain a coherent view of it (primarily by mapping each disk block to at most one physical memory location).

ankrgyl · on Oct 2, 2014

Happy to fill in the gaps here. Building a data structure that is lock free, performant at concurrent reads and writes, and fast at scanning data is very challenging. Skip lists enable us to implement this in a straightforward way, but are aggressive with respect to random access. You can indeed leverage swap to extend a lock free skip list onto the disk, but the fact that a skip list links together nodes with pointers does not play very well with swap. Swap performance for random access is an extremely well known problem in Linux.

Disk-based databases reduce their dependence on random access by using data structures like B-Trees, which do a much better job (i.e. maintain good scan and seek performance) as they spill on and off the disk. These data structures, however, are not feasible to implement lock free.

jeffdavis · on Oct 2, 2014

Do you think this is a fundamental trade-off, or does it just happen that the structures we know about that are good at lock-free happen to have a lot of random access?

Kind of a loaded question, because obviously new research might reveal new algorithms. I'm just trying to get an idea whether you think there is a fundamental connection between lock-free and random-access structures.

colanderman · on Oct 2, 2014

So it's not a disk vs. memory thing, it's a B-tree vs. skiplist thing. That makes a lot more sense.

latch · on Oct 2, 2014

I'm pretty sure the two are related, no? A B-trees is a disk-friendly structure (shallow and wide), skiplist aren't (but they are a concurrency friendly structure)

colanderman · on Oct 2, 2014

Right. My point is that "you can't use atomic instructions with disk-backed databases" is oversimplified to the point of being incorrect: it's wholly feasible for, say, Postgres to implement a skiplist-based index (or some other lock-free data structure) while still backing its store with the disk, with the only downside being a performance hit if working memory is not large enough.

dschiptsov · on Oct 2, 2014

How exactly you have solved the problem of "durabity" given that any disk is orders of magnitude longer to access than RAM? The Mongo way - push buffers to be written to an OS kernel and forgot?)

Let's put it straight. Words should mean what they originally meant, not how sales or PR guys want them to be meant.

For us, a database is piece of software which enforces data consistency and is fail-safe. That's why we want it - to reliable store our data.

According the laws of physics, the throughput is bounded by the slowest operation (the same way a durability of a chain is bounded by a weakest cell) and in case of a database, this operation is an atomic update in a persistent, failsafe storage with consistency guarantee. That is what we call a transaction. This was the meaning since System R or the first version of Berkeley DB.

Transaction "in memory" is a nonsense for us, so are any memory-to-network-stack or memory-to-client-process "benchmarks".

As long as there is no proven (not advertised) consistency guarantee and fail safety guarantee such system should be called a Cache, not a Database.

Call your product Cache, and all the confusion would disappear. In other words, "in-memory" is not a database, it contradicts with the meaning of the word, no mater how your sales people are trying to twist the language.

There is no "specialized databases". There are specialized caches. OK.

robmccoll · on Oct 2, 2014

Do you mean forcing flushes of mmaped regions to disk via an fsync call? The kernel isn't really hiding anything from you there, but if you want to go to that level, then what about disk caches too? Do you trust consensus protocols or is replication also not durable enough?

jcrites · on Oct 2, 2014

I think he's making a fair point. If the result of a write is that data is stored only in memory on a single machine, and is not written to a more-durable place as a blocking part of the write operation, then I would consider this data store a cache. Because at the point when my application returns from the operation to store the data, the data can still be lost due to a number of common failure scenarios.

If the data is replicated across a number of machines, then determining the durably is a more complex task. If sending data to the cluster blocks the user's write on a response from the quorum of the cluster, then I think the cluster can be considered reliable data store, even if the cluster nodes don't necessarily write to disk immediately or at all. Perhaps they keep the data in memory forever, and there are enough replicas that the chance of data loss (through all replicas being down at once) is low enough to accept. (It's conceptually similar to replicating across many disks and losing all the disks at once, although I agree, it seems intuitively a lot more fragile than disk based replication)

A more realistic model might be in-memory replication by quorum followed by buffered writes to disk on all replicas too. Since there are many replicas, it's not unreasonable to "persist" data in RAM on the many repliacs, but we also get the full benefits of disk persistence after a short time.

Just thinking out loud here.

ignoramous · on Oct 2, 2014

You're right that memSQL absoutely crawls when it is set to "fully-durable" mode where it has to sync write to the disk. But it was never built for the use-case, and I guess that's missing the point.

memSQL, AFAIK, was built for highly concurrent reads and writes. It uses lock-free in-memory data structures, couple that with an eventually-durable mode, in addition to optimizing queries by the way of re-compiling SQL to CPP (something similar to what HHVM does for PHP), I think it would result in performance gains in certain (and not all) use-cases.

memSQL, if we dare simplify, is really a high-throughput cache front-ending an eventually-durable datastore.

I wonder how would RocksDB + a disk-backed DB measure upto memSQL.

Ref: http://dom.as/2012/06/26/memsql-rage/

dschiptsov · on Oct 2, 2014

Excuse me, does eventually-durable means some time yes, some time no?)

To put in another way - does any bank would accept eventually-durable as its financial transaction storage?

More seriously, without error propagation to the client, such that it could "know immediately" which one of his transactions failed, there is no durability at all.

Ask all these telecom guys - without immediately returned ASK packets and timeouts a protocol cannot be called "reliable".

ignoramous · on Oct 3, 2014

I don't work for MemSQL. I know what I know from various posts on Quora, and the MemSQL engg blog itself. So take everything I say with a grain of salt.

MemSQL has two modes (if I am right): synchronous and asynchronous. I believe, if you're a bank, you'd want the first one. But if high throughput is your priority, you can afford to sacrifice durability for that, I think MemSQL would be an extremely fast solution compared with other disk-first datastores.

One thing that memory-first datastores (MemSQL, VoltDB) would be really good at is lots of reads.

xtacy · on Oct 2, 2014

I'd appreciate if you could have a new blogpost with details about the experiment so the community can verify the claims/reproduce it with other DBMS solutions.

Also, while I understand the statement about "lock-free", it seems to suggest the code will be as fast as main memory access, which it most certainly will not (due to cache coherency protocols kicking in while doing cswap, locked-add, etc.) Your comment doesn't say this, but I am noting it here for others.

lazyloop · on Oct 2, 2014

That message is from 2012.

dang · on Oct 2, 2014

Thanks. Added.

zwischenzug · on Oct 2, 2014

These things are like a con-trick on naive developers.

Do they really think that the brains behind the big iron DBs hadn't considered this stuff?

Our transactions don't do _any_ fsyncs, with a small risk of losing 4k of data that's recoverable from application logs in any case. As the post says you still need sort algorithms, pages &c &c.

abjorn · on Oct 2, 2014

He mentions RethinkDB as an in-memory DB, but..it's not.

latch · on Oct 2, 2014

RethinkDB has a long history ... someone better informed can shed some light on this, but when this was written, it was being specifically designed to be used with SSDs. Clearly, that's still not "in-memory", but it was probably reasonable to point out that it was designed with a different set of constraints.

mglukhovsky · on Oct 2, 2014

I'm one of the founders of RethinkDB. At no point did we ever work on an in-memory database product.

RethinkDB is immediately consistent by default -- in our earlier days we did spend a lot of time optimizing our serializer for solid-state drives. However, we never designed it with, "the tenet that [the database] was 100% in memory."

nilsimsa · on Oct 2, 2014

Curious what kind of optimizations you did for SSDs.

mglukhovsky · on Oct 2, 2014

Here's a few links (from previous HN threads and blog posts) that dive into some of the optimizations we made around SSDs:

* https://news.ycombinator.com/item?id=4795443

* http://www.rethinkdb.com/blog/on-trim-ncq-and-write-amplific...

SamReidHughes · on Oct 2, 2014

The trick involves writing sectors (of size 4K, say, and it can depend on the SSD model) sequentially in aligned blocks (of size 1M, say, but it can depend on the SSD model) so that the SSD firmware that can erase one physical block at a time can handle it efficiently. Especially in 2010.

bayareaguy · on Oct 2, 2014

If you are interested in SSD optimizations you may want to read this research from Microsoft: The Bw-Tree: A B-tree for New Hardware[1]

1 - http://research.microsoft.com/apps/pubs/default.aspx?id=1787...

latch · on Oct 2, 2014

Over the last couple years, I've become a fan of in-process databases. Being able to use whatever data structures best fit your data, avoiding the network and serialisation (and subsequent GC) overhead tends to make for clean (and very fast) systems. Obviously it has drawbacks.

CHY872 · on Oct 2, 2014

This is almost always not the best approach. Database software has been written in order to make developing software as painless as possible. Transactions (even at the single write level), foreign keys, indices, constraints are all there to make sure that your database should as much as possible not be able to end up in a broken state.

Writing a database yourself is immediately reinventing the wheel.

In terms of the advantages you mention:

- Datastructures - this is a niche advantage. There are probably some constraints, but all of the datastructures that I've seen map nicely to a serial form - one that might not is probably bad smell (though that might just be because I haven't seen any decent examples).

- Avoiding the network - you can do this already with (say) sqlite, which is probably ideal for what you're describing.

- Serialization - bad luck - if you want to persist your data you must still serialize it.

- GC - with a modern runtime that can do generational garbage collection this is as good as free.

- Clean - usually at the expense of correctness. If you have enough data and time, you'll start seeing interesting bugs. I agree that writing raw SQL is frustrating, but there are usually libraries that do standard tasks whether that be configuration or ORM.

- Fast - you sound like you're at a data scale where the difference in speed doesn't matter - and even if you did have data on the order of hundreds of millions of rows, you'd likely only see a very small perf difference (assuming sane batching etc).

Writing your own database is probably fine if no existing database meets your needs (whether that be price, performance, storage capacity, etc) but this is rarely the case for all but the most demanding applications.

latch · on Oct 2, 2014

I think you make good points. I wanted to clarify how I used in-process DBs because it clearly wasn't obvious from my post.

I still rely on a "normal" database as the authoritative source of all data, and all writes go to this DB. But changes to the data are queued, workers pick up these changes, and update the in-process DB which handles 90% of read-requests (which, for most apps, is probably > 90% of their traffic).

In some ways, it's a cache. But, it doesn't do misses (it's assumed all the data is there), it's kept relatively consistent (you can normally propagate changes in < 2s), and you have access to much more than just a GET.

Also, about performance...this lets us handle read requests with less than <1ms response times vs say..50ms (or 200ms if we're talking about most ORMs). That can have a huge impact on (a) usability (especially for things sensitive to low latency (autocomplete)) and (b) cost (difference between $200/m and $10K/month). You also don't need to put ALL your data in this. Just the most frequently accessed.

Hope that clarifies a little.

imanaccount247 · on Oct 2, 2014

>Transactions (even at the single write level), foreign keys, indices, constraints are all there to make sure that your database should as much as possible not be able to end up in a broken state.

But you can have all that in process. Consider the most obvious case of just inverting things: writing your app directly in postgresql.

tlack · on Oct 2, 2014

I agree. Using rich data structures that you can manipulate instantly is awesome. And usually your language will have superior query and manipulation abilities, without the mental gymnastics of adapting your algorithm to the underlying data store's query interface.

I'm looking for a good Node library that handles serializing /restoring my program state. Have you seen any of note? Seems like an easy task, but I'm hesitant to deploy anything substantial without much experience with long term operation.

zak_mc_kracken · on Oct 2, 2014

Reminds me of Prevayler [1], a database that had similar insane claims ("twenty thousand times faster than a regular database") and which carefully omitted to say that it was an in-memory database.

[1] http://prevayler.org/

nl · on Oct 2, 2014

Prevayler is always made the point that it is in-memory. Notably the front page says: It is an implementation of the Prevalent System design pattern, in which business objects are kept live in memory and transactions are journaled for system recovery

It's been around for quite a while. Nearly 15 years ago Ward Cunningham (yes, that Ward Cunningham) was interested in the system and wrote a benchmark for it - I was one of the people who contributed data[1].

I never thought it was a great idea then, but back in those days computers were usually RAM-starved. Now it might make a little more sense.

[1] http://c2.com/doc/SerializationThroughput/

stephen · on Oct 2, 2014

I used Prevayler on a small project a long time ago; it was awesome. The webapp was just incredibly snappy. Thanks for the nostalgia. :-)

jey · on Oct 2, 2014

What's up with the random plug for "VoltDB" apropos of nothing?

latch · on Oct 2, 2014

I think he's just giving credit where credit is due. I read it as "I'm making a generalisation about in-memory databases, but VoltDB, which people would associate as an in-memory database, stands out as having other worthwhile innovations."

mattmcknight · on Oct 2, 2014

It's mentioned in an earlier message in the thread: http://www.postgresql.org/message-id/4FE899CE.60901@optionsh...

klodolph · on Oct 2, 2014

The topic of discussion is in-memory databases. The post made a general statement that in-memory databases are very similar to disk-backed databases. Voltdb is called out as an exception.

Alex3917 · on Oct 2, 2014

The guy who created Postgres also created VoltDB, so I'd imagine that most of the Postgres community is already familiar with it from watching his talks. (Which are worth watching, fwiw.)

https://www.youtube.com/results?search_query=michael+stonebr...

jhugg · on Oct 2, 2014

It's a bit unfair to pick on a 2-yr old post, but I don't agree with what Josh is saying here, beyond the fact that anyone claiming to be the world's fastest database is talking about a particular benchmark that makes them look good.

So there are quite a few things you can do to go faster once you've decided your data is going to fit in memory. An easy one is to build memory-centric indexes that never hit disk. These can be several times faster than the disk-page-sized b-trees that most disk-backed systems use, even if the disk-backed systems are sitting on a ram-drive.

A second thing you can sometimes get away with is more granular locks on data structures. Since you don't worry about holding a mutex during a page fault, you don't need super complex concurrent data structures. It might be faster just to shove a queue in front of an index and let each index lookup happen sequentially. This is one of the biggest VoltDB tricks (though I'm simplifying). It's also easier to build something like the MemSQL skip-list in memory, but I've never seen it outperform the simpler VoltDB system at either non-column-based scanning or at index lookups. Lock-free data structures may not block, but they're not simple and they make other tradeoffs that affect performance.

As far as scanning tuples goes, which the recent MemSQL marketing calls out, I don't think that has much to do with disk or memory, but rather the architecture and implementation. Vertica's big innovation was using a column-store to scan faster and this seems to be what MemSQL is doing here. If you want to do full table scans, this is very smart. Although you pay a loading penalty as a tradeoff for query performance. Vertica's huge win was technology to make that loading penalty smaller; I'm not sure what the costs to load MemSQL's column data are. I suspect one key innovation is being cheaper than Vertica.

Another legit innovation of VoltDB was to give up external transaction control along with disk. The only thing worse than keeping a transaction open while waiting for disk is keeping it open while waiting for a user. Getting rid of both allows for a crazy fast system when you can assume no substantial waits in your fast path. Using batched SQL or stored procs, you can keep the conditional logic you love, but go fast. Many NoSQL systems (and MemSQL) get this same speedup by just dropping transactions altogether. MVCC does a lot to alleviate this problem, but adds its own complexity; it's really hard to get right/fast over a network.

Hekaton is another system that has made really different choices when focused on memory and reaped benefits. I think for business reasons, it's really tied to MS SQL Server's disk implementation. Though that has pros to go with cons. Hekaton is also new since Josh wrote this post.

And all of these systems can be fully/mostly persistent on disk. It's more important that your data fits in memory for design choices than it never writes to disk. VoltDB in particular can perform many millions of synchronous writes per second. It can do this with SSDs or even with spindles using a flash-backed disk controller.

Disclaimer: VoltDB Eng.