A well-designed disk-oriented database engine operating in cache will often be a...

jbert · on Oct 2, 2014

> The biggest insight was this: memory behaves like complex block storage, with everything implied with respect to optimization, if you really care about performance.

Could you describe the factors causing that? Is it just the cacheing hierarchy or are there other effects?

jandrewrogers · on Oct 2, 2014

The biggest effects are varying page sizes as bits move through silicon and latencies to memory access, which reflects cache hierarchies but also NUMA, DIMMs, and other considerations with respect to how memory is connected to the system and the topology of how memory moves into the local CPU cache. Due to the physical size of supercomputers, the latency variance for "flat" cache coherent memory access can become so bad that it has pathological characteristics on algorithms that work adequately with lower latencies. With how fast and efficient CPUs are today, on single system boards this has a huge effect. For example, it is not uncommon to lock a core to physical memory that is directly attached to that CPU's memory channels instead of some other CPU's memory channels in databases.

Some performance-sensitive open source libraries do varying amounts of memory topology aware optimization. The rudimentary tricks can often be worth a 2x performance boost for in-memory processing.

In the same way that really clever schedulers and disk I/O systems (they don't let the OS do it) in good databases try to optimize the average access time and variance, high-performance in-memory systems use very similar techniques to optimize the average access time to any particular piece of memory. The techniques are very similar in the abstract but there is a lot less written about the nuances of in-memory scheduling; it often involves understanding lower level details of the silicon than programmers are used to studying. The net effect is a large reduction in busy stalls and contention.

Also, people tend to forget that even though lock-free algorithms are fast on average, they tend to have a lot of overhead in systems with many cores unless they are used sparingly and accessed relatively infrequently. Eliminating that waste can have large performance benefits.

To your question: In order to minimize the need for thread coordination and optimize memory scheduling, the first thing you need to do is design algorithms and data structures that approximately match the topology of your storage, whether memory or disk or both. And that rarely looks like idealized random access memory. Then you optimize for locality of execution. I've designed database kernels that had thousands of parallel threads running. It turns out that the techniques used to eliminate almost all coordination in that case produce extremely efficient database kernels on a commodity server with vastly smaller resources.

Understanding the topology of real silicon systems is a large part of "optimization".

znep · on Oct 2, 2014

That makes a lot of sense. Regardless of where your data is, you want to best map your accesses to the underlying reality of whatever hardware is in use. And memory has significant locality on any even moderately big system.

I do think this glosses over the difference in data structures that someone focusing on efficiently storing and reading the data to and from disk might use versus the data structures that someone focused mainly on efficiently representing the data in memory might use, ignoring durability. Queries are one part of this, but dealing with updates and indexing can also be quite important.

I don't know if that difference is fundamental to the design of a database or just more of a lingering consequence of many databases being designed around most data being on disk, but it is what I observe looking at a variety of current database solutions.

Getting back to the main article here, I've been doing some basic testing against MemSQL today and, "world's fastest" aside, I like a lot of what I see, other than painfully long query parse/compile times. It does, however, appear to be true for my queries that most of the performance benefits are due to the distributed query engine and not due to any fundamental data structure differences compared to something like postgres or mysql/innodb. But my queries are very anti index.

SpaceCurve also sounds interesting, hopefully we can firm up our use cases enough and get far enough along in a technical evaluation that I can find time to play with it.

otakucode · on Oct 2, 2014

For complex optimization problems like this, why do I never heard about using the significant research done on automatic optimization being used? I would expect each architecture to have particular optimal strategies of memory usage that would be able to be discovered with the machine learning tools used to solve other sorts of optimization problems. I take it from your posts that there isn't a tool you can load up, run it for a few days, then export the most successful strategies it found and then use for parameters obeyed by the system being implemented? Is there a particular reason why? Is the complexity far greater than I'm presuming?

jbert · on Oct 2, 2014

Thanks for that. I hadn't realised that most (large) commodity systems these days had a NUMA setup. I thought it was relegated to the more esoteric end of things (super computer setup, single-system-image clusters).

Thanks - I can see that blindly treating NUMA memory as "uniform random access" is going to hurt.

n1ghtmare_ · on Oct 2, 2014

Wow, dude, your post made me feel VERY incompetent. This happens a lot on HN.