B+ Trees and why I love them, part 1

gtrubetskoy · on Aug 17, 2013

I too "discovered" B-Trees (and B+Trees) recently and wrote about it: http://grisha.org/blog/2013/05/11/relational-database-on-top...

I even implemented a key/value backed SQL database (using Thredis, which is Redis and SQLite) http://grisha.org/blog/2013/05/29/sqlite-db-stored-in-a-redi...

SQLite3 has an awesome C implementation with copious comments if you really want to understand them, BTW.

The most fascinating thing about B-Trees is that the entire data structure is stored in equally sized blocks (pages), and getting a page by number is a very easy operation in block storage, which is what our RAM and hard drives are (even though back then they were tapes and punch cards).

Few people realize that it's not just DB's that use B-Trees, our filesystems are also largely B-Tree based.

It's remarkable how few today's "developers" understand how they work or even know what they are, yet they are so fundamental to everything computers do.

Without B-Trees our present world of computing wouldn't exist as we know it. It could be the coolest data structure ever :)

Flow · on Aug 17, 2013

I've read your first link and I don't get why you talk about "A relational database needs to store rows in order". AFAIK you have to add an "order by" clause to ensure ordering of any kind.

Relational algebra is probably not dependent of ordering for the common operations. Of course, the actual implementation of a relational database might need ordering of keys to be reasonably efficient. Is that what you meant?

Millennium · on Aug 17, 2013

The question is "what in order" means. The DB needs to store rows in some order that makes sense to it, but that might or might not match whatever order a user might want.

That's where ORDER BY comes in: presenting the data in some order that makes sense to the user. That might be completely different from what's stored in the DB. But the DB needs its own order, in order to be able to access things in a reasonably efficient manner.

Millennium · on Aug 17, 2013

To put it another way, the SQL standard doesn't guarantee any particular ordering. This does not mean that table storage must be unordered: it means that a database can store them in any order it wants, which might not be any order at all, or it might have some particular kind of order which the DB depends on to do its work.

But as a database user, you cannot count on that order matching what you, personally, would have implemented if you'd had a chance. ORDER BY clauses allow you to tell the database to sort the results of a query according to something appropriate for your particular needs, but this has no effect on how they are stored.

gtrubetskoy · on Aug 17, 2013

It needs to store rows in order to support indexes, so that a row could be found without having to scan the whole table, and that's exactly what B-Trees let you do, and also add or remove rows without having to rewrite the whole table, but only affected blocks.

Flow · on Aug 18, 2013

What you say, isn't that more or less exactly what I wrote?

koverstreet · on Aug 17, 2013

It should be noted that textbook B+ trees, like those described, are nowhere near state of the art performance wise. It's possible to do orders of magnitude better.

elbee · on Aug 17, 2013

The B-tree implementations used in a lot of databases have tweaks, but they are surprisingly similar to the textbook descriptions. In general, B-trees are actually a very useful data structure when: a) Your data is much larger than main memory (e.g. terabytes/petabytes of data and gigabytes of RAM). b) You want to support random insert/update/delete, seek, and next/previous key. c) There has to be a reasonable constraint on storage space (i.e. data density/compaction). In those cases the log(n) performance of a b-tree is extremely hard to improve on.

Alternate data structures a useful if: 1. The ratio of ram/disk is more favourable, in which case using an improved in-memory data structure can be more efficient. 2. Not all operations have to be supported. In particular eliminating the next/previous key operations allows something like a hashtable to be used.

koverstreet · on Aug 17, 2013

There's two reasons a textbook b+ tree is never going to perform very well:

* Binary searches suck

The memory accesses of a binary search look pretty much like complete random access, so prefetching is useless. Once your b+ tree is big enough that most of it won't fit in L2, you've got a tight inner loop that's waiting on DRAM at every single iteration.

This one is at least more or less solvable without changing the rest of the design, but then if you also want to solve the other main issue it gets harder.

* With small btree nodes, the overhead of looking up and locking btree nodes (even assuming they're all cached in memory) kills you.

Trouble is if you make your btree nodes big enough to fix that overhead, they're now so big that rewriting them when you insert a single key is just ridiculous.

So where you end up is to get anywhere near the highest possible performance you're forced into log structured btree nodes. Which actually has its advantages for solving the first problem, but now what you've got is really a hybrid b+ tree/compacting data structure.

Source: bcache author.

nkurz · on Aug 18, 2013

The memory accesses of a binary search look pretty much like complete random access

Why would this be? I would think that even in a naive implementation the fan-out is high enough that top levels of the B+ tree will always be in cache, so that you are only hitting RAM for the last level or two and the leaf. For in memory use, I thought the usual argument against binary search was (as 'elbee' mentions) the poor branch predictability, not the caching.

elbee · on Aug 17, 2013

I'm not talking about a b-tree that won't fit in L2, I'm talking about a B-tree that won't even fit in main memory. In those cases, even with SSDs, the cost of pulling a page off disk dominates the in-memory cost of the binary search so the number of I/Os needed to find something becomes the dominant factor in performance. That leads to a focus on data density and cache replacement algorithms.

"Binary searches suck"

That is a valid point. Cache misses when doing the search inside of the b-tree can be painful. I know that there is some research into structuring the b-tree nodes to be more cache-friendly (David Lomet mentions this in his "Evolution of Effective B-Tree" paper: http://research.microsoft.com/pubs/77583/p64-lomet.pdf) but haven't actually seen any in production.

Part of the problem here is that page density is so critical that people are often wary of having a less-efficient, but more cache-friendly data representaton. It is normally better to have 10% more records in main memory than being able to search the nodes a touch faster. Various forms of key prefix compression are an example of this tradeoff.

Another consideration is the actual cost of doing the key comparison. I'm most familiar with relational database systems and they all support multi-column keys (e.g string+integer+boolean+date), which normally lead to an expensive, branch-laden comparison function because the comparison has to be type-aware and support various collation options. The cost of doing that branching means the CPU cache misses are a smaller overall part of the overall search cost. One exception to that is ESENT, which uses memcmp-able normalized keys, but doing that creates its own set of problems (Unicode is a pain, lack of denormalization means storing data twice, etc.).

"Rewriting them when you insert a single key is just ridiculous."

I don't think anyone does this. Most implementations have settled on a system that stores the keys in the node in an ad-hoc fashion and have an array of 'pointers' to the keys which starts at the end of the page and grows towards the front (except for Postgres, which appears to do it the other way around). Inserting a new key means moving the pointers, but not the actual keys. Some systems try to avoid doing even that -- for example the InnoDB engine in MySQL links together the keys and the page directory only points to every 4th-8th key. I believe that this is an attempt to balance the speed of the binary search against the cost of update/delete.

In addition, with write-ahead-logging the node isn't flushed when updated, instead the log records describing the update are flushed and the page is lazily flushed in the background by a checkpointing process, hopefully after several other updates have been made to the same node.

Microsoft SQL Server, Microsoft ESENT, MySQL's InnoDB and Postgres' b-tree index all use minor variations on the basic textbook b-tree for their data storage.

Source: Many years working on the ESENT database engine, a brief stint in Microsoft SQL Server, some hacking done on the MySQL InnoDB engine.

cdavid · on Aug 17, 2013

I am looking into B+-trees for some sparse array implementations, but little experience with b+-trees (or more advanced features). Do you have recommendation of code and papers to read besides the one already cited ?

elbee · on Aug 17, 2013

This reading list will cover a lot the current generation of b-tree techniques, but not the cutting-edge stuff (e.g. Tokutek):

* Ubiquitous B-Tree (Douglas Comer): http://doi.acm.org/10.1145/356770.356776

[A good review of the state-of-the-art in 1979]

* Prefix B-trees (Bayer, Rudolf and Unterauer, Karl): http://doi.acm.org/10.1145/320521.320530

[This paper introduces the idea of what I'm used to calling suffix compression -- the internal page pointers in a B-tree don't need to store the entire key, just enough of it to uniquely identify the node compared to its neighbor. This can be a very useful optimization for some use cases. The idea of reassembling a key as you seek down the tree is cool, but I don't think anyone does it in practice.]

* B-tree indexes, interpolation search, and skew (Goetz Graefe): http://doi.acm.org/10.1145/1140402.1140409

[A lot of interesting ideas around doing search inside of a node, including cache awareness. Interpolated search looks very interesting, but I was never able to get it to go faster than binary search. That could have been my fault though.]

* A survey of B-tree locking techniques (Goetz Graefe): http://doi.acm.org/10.1145/1806907.1806908

[I really like Graefe's database survey papers. They are easy to read and practical/implementation focused. This one discusses a lot of important concepts: the difference between latches and locks, latch coupling, range locks and increment locks.]

leif · on Aug 17, 2013

Actually, you can beat B-trees pretty handily across the board in exactly the scenario you described (a, b, c).

The log(N) performance of a B-tree is not just extremely hard to improve on (for searches), it's impossible to improve on. The lower bound for searching (in the DAM model) is log(N)/log(B), and B-trees meet that.

But B-trees are also log(N)/log(B) for insertions, which is, it turns out, pretty damn slow. There's an optimal trade-off curve between insertions and queries (the fastest data structure for insertions is to just log them, but that forces all queries to read all the data), and B-trees are on it, but there are many more interesting points on that curve.

There is a family of data structures that matches B-trees' performance on queries while blowing it completely out of the water for insertions. The COLA and Cache-Oblivious Streaming B-tree are where you'll find them in academia, and the implementation I work on has a very marketing-flavored name: the Fractal Tree. All that means is that we took the theory and started implementing it and came up with enough innovations in the implementation that it sort of needed a new name, but it's spiritually the same concept.

I've written a fair amount about this that I'll link to at the end, but here's a brief description so you don't think I'm making this all up.

Basically what we do is we take a B-tree and, on all the internal nodes, we stick large buffers that accumulate messages. Want to insert something? Just stick it in the root's buffer, you don't need to do any I/O to find the proper leaf node it needs to go in (yet). If the buffer's full, flush it down by taking all its messages, sorting them between the root's children, and putting messages in the buffers in the children. This flushing can cascade as you'd imagine, and splits and merges work about the same as in a B-tree.

So now what's the analysis?

Well, the tree has the same shape as a B-tree, so searches have to look at the same log(N) number of nodes (which in practice is almost always just 1 for the leaf node after cache hits on the internal nodes, both for B-trees and for Fractal Trees, but the asymptotic analysis also tells you they have the same query cost).

For insertions though, let's walk through the process. The tree has height log(N), which means that, for an insertion to get "fully inserted", meaning it reaches the leaf node and won't be flushed any more, it has to get flushed down log(N) times. And what's the cost to flush a buffer full of messages? That's just O(1) I/Os, for the parent and children (and in practice it's actually only 2 I/Os because you can just flush to a single child, not all of them). But a buffer flush does work for O(B) messages at once, so the amortized cost to flush a single message down one level is just O(1/B). Do that log(N) times, and the insertion cost is O(log(N)/B), which is in practice around 100x less expensive than the B-tree's O(log(N)/log(B)) insertion cost.

On top of this, while for B-trees you want small leaf nodes because you're going to be reading and writing them all the time, for Fractal Trees, since the goal is to get a lot done with each I/O, you actually want large leaves, on the order of a few megabytes each. This has two nice effects:

1) While range queries on a B-tree can slow down as the tree ages and the leaves start to get randomly placed on disk, range queries on a Fractal Tree stay fast because each time you do a disk seek, you get to read and report a few megabytes' worth of data. This basically solves the "B-tree fragmentation" problem that makes database users run optimize table, or vacuum, or reIndex() or compact() operations like madmen.

2) Compression algorithms (like zlib, our default) can compress large blocks of data much more effectively than they can compress small blocks. So InnoDB, which has small blocks like most B-trees, if you turn on compression, apart from eating CPU as it tries and fails and re-tries to compress your data to fit it into its block size, it'll only get at most about 4x compression. In contrast, TokuDB (our MySQL storage engine using Fractal Trees [4]) routinely gets 10-20x compression without breaking a sweat.

I have some blog posts about this [1] and [2], and our benchmarks page is [3]. We also have a version of MongoDB in which we've replaced all the storage code with Fractal Trees, we call it TokuMX [5]. Don't mind the marketing haze, it's all serious tech under the hood.

[1]: http://www.tokutek.com/2011/09/write-optimization-myths-comp...

[2]: http://www.tokutek.com/2011/10/write-optimization-myths-comp...

[3]: http://www.tokutek.com/resources/benchmarks/

[4]: http://www.tokutek.com/products/tokudb-for-mysql/

[5]: http://www.tokutek.com/products/tokumx-for-mongodb/

elbee · on Aug 17, 2013

That looks extremely interesting! The idea of amortizing the cost of inserts is fascinating. Looking at the design you sketched a few questions come to mind:

1) Multi-threading: suppose I seek down the B-tree for key K. Most B-tree implementations use the latch on the node containing K as the final arbiter of concurrency. For example, if I'm looking at K and then I want the next row (perhaps because I'm using the new Index Condition Pushdown optimization in MySQL 5.6, or I'm doing an online index build and need to scan all the rows) then I can simply look at the next row on the page I currently have (read) latched. With a fractal tree it looks like I have to worry about someone inserting a row immediately after the current row because that insert could have been cached at a higher level. Does this mean I need to keep some sort of latch/lock on the entire b-tree path down to the page I'm reading, instead of using latch coupling to work my way down? Alternatively do I have to work my way down from the top of the tree every time I want the next key?

2) How can you check for uniqueness? Suppose I create a table like this:

CREATE TABLE t1 (id NUMBER PRIMARY KEY, val1 VARCHAR2(30));

Amortizing the inserts seems to imply that primary key uniqueness violations can't be discovered until the inserts are pushed all the way down to the leaf?! In general uniqueness is an important part of data normalization, a good input for query optimization and normally required for foreign key constraints...

leif · on Aug 17, 2013

It is highly concurrent but you need some tricks beyond that simple description. Here's how we did it: http://www.tokutek.com/2013/01/concurrency-improvements-in-t... http://www.tokutek.com/2013/02/concurrency-improvements-in-t...

Yes, unique checks are bad. They make it perform as badly as a B-tree for unique inserts. There are sometimes ways around that but at some level if you aren't reading in the leaf node, you're going to be at a loss for some information. B-trees seem to have spoiled users into thinking uniqueness checks don't make inserts any more expensive, when in fact that's just because B-tree inserts are already that slow.

elbee · on Aug 17, 2013

So out of the main b-tree operations (Insert/Replace/Delete/Seek/Next/Prev) you make a convincing argument that Tokutek can be faster than b-trees for inserts, if you use non-unique indexes (which automatically disqualifies the primary index in most cases, and makes foreign keys hard).

That is good but not quite "beat B-trees pretty handily across the board" :-)

leif · on Aug 17, 2013

Many primary keys in the wild are sequential, which makes the unique check hit cache and not require any I/O.

Let me be clear: in the worst case with a unique index, Fractal Tree indexes are the same speed as B-tree indexes. In most cases, they're far better. When you add in compression and agility, it really is "beat B-trees across the board."

elbee · on Aug 17, 2013

To be honest, I have no idea what "agility" is. Perhaps I am too cautious but I'll want to see a lot more data than some hand-picked benchmarks and a mystical "unique constraints aren't interesting" before I'm completely convinced.

[Random aside: I greatly dislike block (i.e. InnoDB-style) compression, which is what Tokutek seems to use. Decompressing the entire block just to retrieve one column of one row is crazily expensive, especially as the decompression cost goes up as the block size goes up. There is also the problem of deciding whether to store compressed blocks in the buffer cache (slow), decompressed blocks in the buffer cache (wastes ram) or to have a cache of uncompressed blocks (the InnoDB approach, which kind of combines both problems). I think that format-aware page or column compression (like Oracle, SQL Server or ESENT) is far more effective.]

I guess that this has gotten way off topic but I am glad there are people out there doing exciting new things in the B-tree space. I will keep a closer eye on Tokutek in the future.

leif · on Aug 17, 2013

Agility refers to our online schema change abilities in mysql (hot column add/delete, hot indexing).

Our compression technique is naive, you're right (but we don't decompress the whole 4MB for one single point query, as you may be thinking, we're a bit more sophisticated). We have some more ideas in the pipeline if we need them (basically a bunch of tricks you can play when you know the schema), but the fact remains that what we have right now is extremely effective and there isn't much incentive yet to finish off our smarter prototypes.

leif · on Aug 17, 2013

(Don't worry about off topic I love deep conversations like this. Come to our mailing list if you want to continue more though, we may be exhausting HN)

nkurz · on Aug 18, 2013

Actually, this is the type of back-and-forth I (and I think others) would love to see more of. A dozen replies in, lots of good ideas, and no one is calling anyone else an idiot or a troll. Feel free to continue!

leif · on Aug 17, 2013

I now realize I didn't answer how we check uniqueness. We just do a query like any other normal point query, using a serializable transaction, and if we pass the check, then we do the insert with the same transaction (which took a row lock when we did the query because it was serializable, so nothing could have violated the uniqueness between the query and the insert).

leif · on Aug 17, 2013

Also, MVCC snapshots mean that if you're doing a range query, it doesn't matter if someone comes in and injects a message above you while you're scanning a node.

gtrubetskoy · on Aug 17, 2013

So why isn't everyone using this instead of InnoDB these days? Is it not production-ready yet?

leif · on Aug 17, 2013

It is production ready, and many people are using it, but as with any deep stack technology, growth is a slow process. From an engineering standpoint, I'd say the data structure wasn't fully mature (ready to replace nearly all uses of InnoDB) until probably the end of 2012. Most of this was issues with concurrency that we methodically picked apart from the 5.0 release through 6.6.

A few users still point out annoyances here and there (like the location of files on disk, how certain metadata is presented at the MySQL level, dealing with corner cases in the query optimizer) that don't have to do with the data structure, but with the integration as a MySQL concept, and though these complaints are rare, they will eventually need to be addressed, and it's hard to find the manpower to address them immediately.

Part of this problem is that it just takes a long time to sand down all the rough little edges of a product, even though the core data structure is mature. Another part is educating people, changing expectations (for example, teaching people that unique indexes do have a higher cost than non-unique indexes), and generating better documentation. It's a slow process but we're confident that it's progressing and will continue. Recall that InnoDB, while broadly accepted as superior to MyISAM for many years before, only became the actual default engine in MySQL 5.5.

justin66 · on Aug 17, 2013

> We also have a version of MongoDB in which we've replaced all the storage code with Fractal Trees, we call it TokuMX [5].

What's the deal with TokuKV? Is it a working drop-in replacement for BerkeleyDB and similar key-value stores?

leif · on Aug 18, 2013

It's not really a drop in replacement for BDB, it's more like a library whose API was inspired by BDB. You're welcome to use it directly, but it doesn't implement all of BDB, we have added a bunch of things (like db->update), and there may be some weird contractual things you need to get right that we haven't documented well. Contact us if you'd like to use it, we can help you.

justin66 · on Aug 18, 2013

Thanks! I'm certainly not hung up on the BDB API, so I will check it out.

awda · on Aug 17, 2013

Please, tell me more.

rodrigoavie · on Aug 17, 2013

This is very awesome. Please consider writing about other tree data structures in the future :)

Specially, sofistications on AVL and Red-Black trees