Architecture of a Database System

jandrewrogers · on Jan 23, 2015

Due to the number of years required to design and implement a solid database architecture, database design principles in current systems always tend to be a bit different than what you would build if you were starting today (whenever "today" actually is). Database designs are always fighting the last war, based on assumptions about resource balance, hardware behaviors, and system architectures that may not be strictly true anymore.

Some related comments, with respect to the article:

- Modern database models are one process per physical core regardless of how many sessions, queries, connections, etc that there are. These cores own the part of the data they work with. This has several advantages on modern computing architectures. It also makes for a pretty elegant implementation.

- Related to the first point, new database kernels are increasingly "shared nothing" even within a single server. As in, physical resources will be cut up between cores / processes and rarely shared. It is like running a cluster within a machine. Again, this has some significant performance advantages on modern machines.

- The key advantage of SSDs, which was not true for spinning disk, is that you can usually guarantee effective disk I/O bandwidth is always significantly greater than the full-duplex network bandwidth to the server no matter what the workload. This has an interesting technical implication: if the I/O scheduler is correctly designed and implemented, an in-memory database engine should never be faster than a disk-backed database. In-memory was only an optimization that made sense for spinning disk; with SSD, if you aren't compute-bound, you can always saturate the network (and if you are compute-bound, in-memory does not help).

jamii · on Jan 24, 2015

Some of the more exciting ideas I've seen recently:

http://arxiv.org/abs/1310.3314 - Understanding the core difficulty in answering relational queries and examples of problems for which current query optimisers always produce plans which are asymptotically suboptimal

http://arxiv.org/abs/1404.0703 - The first theoretical analysis which can relate the choice of indexes to worst-case bounds. Presents a single join algorithm which is asymptotically optimal on every problem without even using cardinality estimates.

http://arxiv.org/abs/1210.0481 - A join algorithm that meets some of the bounds of the above paper and is also fast in practice and can be incrementally maintained.

https://infosys.uni-saarland.de/projects/octopusdb.php - Treating index creation, query optimising, view materialisation, incremental maintenance etc as one large optimisation problem.

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf - Compiling query plans to LLVM because the query plans / indexes are good enough to make the plan cpu-bound instead of memory-bound

http://hyper-db.de/HyperTechReport.pdf - Running OLAP and OLTP workloads on the same database without interference

It seems possible that in the future, far from having a Cambrian explosion of specialised databases, we will be able to store everything in a single db and treat questions of data layout, partitioning, indexing etc as a direct optimisation problem.

adamnemecek · on Jan 24, 2015

Those look awesome, thanks for posting them. Quick question, how do you learn about all these papers? They all seem very recent.

jamii · on Jan 26, 2015

I'm working on join algorithms at the moment, so I spent the last few months getting up to speed on the latest research.

The rest is just from general interest. I spend around ten hours a week reading papers or textbooks. Whenever I find something really mind blowing I follow up citations, set up google scholar alerts for the authors, subscribe to their rss feed etc.

A lot of my favourite papers are linked on the OP site - it looks like a good place to start.

acolyer · on Jan 24, 2015

@jamil, thanks for the pointers! I will store some of these away for future editions of The Morning Paper :)

jamii · on Jan 26, 2015

Awesome :)

Great work on the blog by the way, it's going into my morning reading list.

fspeech · on Jan 24, 2015

Do you know if HyPer is open source? The name makes searching a bit of a challenge.

jamii · on Jan 24, 2015

http://hyper-db.de/index.html

I believe it's closed source and they plan to commercialise it, although I can't figure out where I got that idea from.

falcolas · on Jan 23, 2015

> if the I/O scheduler is correctly designed and implemented, an in-memory database engine should never be faster than a disk-backed database.

If, and only if, you are returning all/most of the data fetched from disk, and not just a fractional portion of the data (which is the most common use case for DBs, especially when you use stored procedures).

In the worst case, you have to scan the entire contents of a several terabyte table from disk just to get metadata about that table. The worst case is usually refactored away, but there are always cases where it is simply not possible.

I've seen some pretty silly speedups with flash (such as attaching the flash memory to DIMM slots with custom kernel drivers), but data served out of memory was still faster (if only because it doesn't require a context switch to read).

jandrewrogers · on Jan 23, 2015

To be clear, I am assuming the same hardware budget for both the on-disk and in-memory case. Ceteris paribus and all that. An in-memory model has no inherent advantages in that case because the same memory is potentially available to both for a given workload. Another way of stating it is that SSD won't slow you down.

Ironically, in practice relative performance between these two models is all over the map. Quality of implementation has much bigger impact than the abstract model. I've seen high-quality disk-backed database engines wipe the floor with in-memory implementations on the same hardware and vice versa.

falcolas · on Jan 24, 2015

Yeah, performance frequently has more to do with the workload than it does the underlying hardware.

One more thing to consider in your test case - the equivalent cost hardware for spinning rust would get you almost exponentially more storage. Server grade SSDs are still very expensive per GB.

justincormack · on Jan 24, 2015

You can add third tier spinning rust for very little extra money, but latency may be an issue.

PublicEnemy111 · on Jan 24, 2015

> Database designs are always fighting the last war, based on assumptions about resource balance, hardware behaviors, and system architectures that may not be strictly true anymore.

I've been studying cache-oblivious data structures recently and have been wondering why they don't seem to be taken seriously in modern database design. COLAs and shuttle trees both seem to considerably outperform traditional B-Trees in terms of random inserts while suffering only a slight slowdown for searches and sorted inserts[1]

[1]http://supertech.csail.mit.edu/papers/sbtree.pdf

pmalynin · on Jan 24, 2015

Saturating an Infiniband (40 Gbps) link with disk I/O might be somewhat complicated.

justincormack · on Jan 24, 2015

A few PCIe SSDs should get you there (again if not CPU bound). You can put quite a lot in a single machine, they come in normal SSD form factor with backplanes. It would still be a bit higher latency than RAM, but much more capacity.

sacheendra · on Jan 24, 2015

I think, he was assuming commodity 1Gbit/10Gbit on which most databases are run. Does anybody use Infiband to interact with their database? It would be interesting to hear from them.

dman · on Feb 3, 2015

Great post.