Copy-on-write B-tree finally beaten.

antirez · on April 11, 2011

I did not read the paper yet, but COW btrees are interesting for practical reasons since it is possible to update the tree just writing in append-only mode, that is nearly the only way to avoid fsync (or alike) but still avoiding trivial corruptions (the new root node is always written at the end of the update). Otherwise you either use fsync as a write barrier or live with the problem that from the OS to the disk itself writes reordering can corrupt very easily your data structure in the event of a crash.

However I think that the real problem here is not at the algorithmic level, but at OS API level: just a provocation, don't you think that in the era of SSD drives where seeks are cheap is strange we resort to append only data structures?

hristov · on April 12, 2011

Actually using append only writing is very helpful for SSD drives. Overwriting a particular piece of data is very tricky for flash drives.

In fact, underneath the hood, flash drives use append only even when you are trying to overwrite something. For example, if you want to overwrite a byte at location X, the flash drive actually writes the new data at a different location Y and uses logical mapping to have the byte at location Y pretend it is the byte at location X. The initial byte at location X will eventually be "recycled" and initialized.

Since Flash does append only writing under the hood anyways, it is a better idea to do append only explicitly so that you get the advantages of append only (e.g., the ability to revert to older versions).

tjarratt · on April 11, 2011

You raise a good point w.r.t. SSD drives, but the reality is that even if most computers being manufactured today included a solid state drive, the vast majority of consumers are using drives with moving parts. It will be at least a few more generations of hardware before OS-level developers will be able to focus specifically on the features of solid state drives when designing the filesystem and main OS APIs.

That said, there is room for someone to develop a filesystem API designed with solid state in mind, or even an entire OS, but I don't know if there would be enough of a market to make that development worthwhile.

Is there Anything about SSDs that interests you, from the perspective of the future of Redis?

sigil · on April 11, 2011

> That said, there is room for someone to develop a filesystem API designed with solid state in mind...

What about the Journalling Flash Filesystem? It's been around for quite some time.

http://en.wikipedia.org/wiki/JFFS2

nitrogen · on April 12, 2011

I don't know much about the technical details of JFFS2, only that it takes forever to mount after an unclean shutdown and runs painfully, painfully slow on larger flash volumes (i.e. >512MB) as they approach capacity. I wrote a tutorial (mostly so I could remember how to do it myself) for converting a SheevaPlug from JFFS2 to UBIFS because of the performance difference: http://nitrogen.posterous.com/converting-a-sheevaplug-from-j...

tjarratt · on April 11, 2011

Someone more knowledgable might want to correct me here, but I think JFFS2 arrived a little too early; the performance gains are not enormous and it suffers from too many disadvantages (namely performance with small blocks and apparently determining free space??).

LogFS is yet another file system designed for larger solid state drives (on the order of gigabytes instead of megabytes), and seems to be a step in the right direction. http://en.wikipedia.org/wiki/LogFS

sigil · on April 12, 2011

> LogFS stores the inode tree on the drive; JFFS2 does not, which requires it to scan the entire drive at mount and cache the entire tree in RAM. For larger drives, the scan can take tens of seconds and the tree can take a significant amount of main memory.

Interesting. I've only used JFFS2 on embedded systems like OpenWRT routers, where you wouldn't see the large drive penalty.

dfox · on April 12, 2011

JFFS2 is intended to be used on raw controller-less flash an thus have to solve many problems that are more efficiently done in hardware. So it's probably only useful in embedded systems. Running normal filesystem on raw flash is good way to make it unusable in few days of light use.

In fact, JFFS2 works by scanning all the storage on mount and having it all in RAM, that causes it's problems with large volumes (slow mounts and also large memory use). In embedded systems, for which it is intended, mounting the volume is rare operation and volumes are generally small, also typical embedded systems running linux has more RAM than Flash.

wladimir · on April 12, 2011

JFFS(2) is aimed at embedded devices that directly access flash chips. With a controller in between, such as with SSD, which performs wear leveling and other tricks to emulate a real harddrive, it might even be hurtful to performance.

I'm not sure about LogFS.

wazoox · on April 12, 2011

NILFS2 probably is a much better candidate : http://www.nilfs.org/en

wladimir · on April 12, 2011

I think you underestimate how serious OS vendors are taking the quickly growing popularity of SSD drives. A lot of work has already been done on OS/filesystem level just for SSD drives, for example to support TRIM.

Sure, it will be a long time (if ever! likely, mechanical drives will remain important for bulk storage) before filesystem developers focus solely on SSD drives, but SSD performance is already very important.

soamv · on April 12, 2011

I have also not yet read the paper, but to your point about SSDs -- SSDs are definitely good for random reads, but for random writes, it seems like having an append-only structure would help avoid write amplification. If you do a lot of small random writes to an SSD that will be slow and reduce the disk's life drastically.

(That said, I'm no SSD expert, so I might be way off)

speleding · on April 12, 2011

The paper actually explains why their new algorithm also works better on SSDs.

crux_ · on April 11, 2011

I don't think append-only has much to do with seek time [1]; as you said, it has to do with corruption & consistency.

Second, they actually tout that their data structure is better on SSDs than btrees (due to block-size obliviousness):

> One major advantage is that they are naturally good candidates for SSDs, where a large asym- metry in read/write block sizes makes life very hard for B-trees. In contrast, the arrays making up the SDA can be written with extremely large effective block sizes.

Footnote in edit: [1] in the case of search trees; for large objects that will be read front-to-back it does become a seek-time issue.

andytwigg · on April 12, 2011

SSDs are typically great at random reads, but not so good at random writes, particularly when you start rewriting the device (see http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-...). The attraction of append-only structures is in alleviating the random writes; but you still need to be careful. For example, CoW trees and log file systems often have a big space blowup, which translates into a slowdown in insert rate, and potentially random IO in the log cleaner/GC. That's a scenario in which the Stratified B-tree does a lot better.

ComputerGuru · on April 11, 2011

Blog Spam. Actual link: http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.4282v2.pdf

chalst · on April 12, 2011

Right, but the blog post does give a second reference.

I prefer giving the Arxiv abstract list, which always links to the current version of the PDF.

http://arxiv.org/abs/1103.4282

snewman · on April 12, 2011

This sounds, in many ways, very similar to the data structure underlying Google's Bigtable (http://labs.google.com/papers/bigtable.html) and its descendants. Multi-version key/value store, data organized into a hierarchy of generations, each generation stored in large linear chunks, Bloom filters to minimize the number of generations consulted for a single-key fetch... it would be interesting to see a direct comparison, too bad the authors didn't mention Bigtable in their analysis of previous work.

I also wish the paper spelled out the data structures and algorithms in more detail. I did a little searching but couldn't find more information anywhere. Has anyone found a more complete writeup?

leef · on April 12, 2011

It's somewhat similar in that BigTable (and Cassandra) perform writes as sequential I/O by writing to arrays (for stratified b-tree) or to SSTables (for Cassandra) and then merge them together using more sequential I/O. This is the not-so-secret-sauce that makes inserts stay fast compared to vanilla b-trees.

However stratified b-trees bring some algorithmic rigor to the merging process of the arrays that provides guarantees on the search time.

The authors do indeed provide some comparison with Cassandra which should answer your question - http://www.acunu.com/2011/03/cassandra-under-heavy-write-loa... http://www.acunu.com/2011/03/cassandra-under-heavy-write-loa...

Summary: 1) Single key read performance for Cassandra depends somewhat on when the last compaction was done (Bloom filters help here but even just a couple of false positives per query eats random I/O. The less SSTables there are the less random I/O) so write spikes create poor read performance and 2) Bloom filters don't work for range queries so they tank in Cassandra in general. Stratified b-trees don't need bloom filters and performs better and with more consistency in both cases.

snewman · on April 12, 2011

Thanks for the links. The statistics on latency distribution are quite impressive, to say the least.

Why do you say that stratified b-trees don't need Bloom filters? Yes, the improved merging discipline reduces the number of arrays to read, but presumably there is often >1 array, which is sufficient to make Bloom filters desirable. Even if you only have two arrays, doubling the number of random I/Os per key lookup is easily enough of a penalty to make Bloom filters worthwhile. The paper itself seems to indicate that Bloom filters are used:

"In the simplest construction, we embed into each array a B-tree and maintain a Bloom ﬁlter on the set of keys in each array. A lookup for version v involves querying the Bloom ﬁlter for each array tagged with version v, then performing a B-tree walk on those arrays matching the ﬁlter."

leef · on April 12, 2011

The paper doesn't say so I am making some assumptions here but if you had a bloom filter per array then as these doubling arrays get really big all the bloom filter would tell you is that the target entry is probably contained in this giant array. A false positive in the bloom filter would cause a pretty significant amount of work.

The stratified b-trees use forward pointers to help target the search in the next array down the tree. Like regular b-trees the smaller root arrays will likely be cached in memory so the the number of random I/O's will be small.

justincormack · on April 11, 2011

Acunu (http://acunu.com) are a UK startup and have an implementation of Cassandra, it supports versions presumably based on this.

dchest · on April 11, 2011

Correct, here's a post by one of the paper authors: http://www.acunu.com/2011/03/big-dictionaries-ii-versioning/

cbetz · on April 12, 2011

this is actually very exciting. I decided to explore the site a little and found this quote:

"most ambitious storage project undertaken since ZFS" (http://www.acunu.com/2011/03/why-is-acunu-in-kernel/)

it looks like they put a key/value store in the kernel and they came up with a new userspace API for it. i can see also see how getting something like this into the mainline kernel is going to be a big uphill battle, but it might actually be a really big win.

tdmackey · on April 12, 2011

Part of my job is working on a in-kernel key/value store for a data center network operating system. The problem it is the context switching between user and kernel space kills your performance if you're targeting sub-millisecond read/writes. That may not matter for something like acunu when you factor in network latency but when you're using the database as part of the packet path it does. In addition, under a high system load your user space process has a high likelihood of getting scheduled out at the ioctl call which makes latency even worse. Although being in the kernel allows you a little leeway in terms durability constraints and all that because if you screw up the entire system comes crashing down anyway.

It will never be in the mainline kernel. Also, although I haven't actually looked at what they did yet, I assume they're just loading a regular old kernel module instead of actually really messing with a lot of the mainline code.

plasma · on April 12, 2011

Are Fractal Tree's better than B-Trees? Apparently so, as used by a database addon to MySQL: http://tokutek.com/2010/11/avoiding-fragmentation-with-fract...

leef · on April 12, 2011

These Stratified B-trees are basically multi-versioned variations of Fractal Trees (aka Cache-oblivious streaming B-trees). This paper even references the tokutek teams paper - http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf.

I wouldn't be surprised if the Tokutek guys implemented something very similar to this Stratified B-tree to implement MVCC.

Kallikrates · on April 11, 2011

edited correct link to the actual paper http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.4282v2.pdf

wnoise · on April 12, 2011

I strongly prefer links to the abstract: http://arxiv.org/abs/1103.4282v2

This lets you: (a) read the abstract before deciding whether or not to read the paper (b) easily find earlier (or later) versions of the paper (c) easily search for other papers the authors have on the arxiv.

jholman · on April 12, 2011

If we're replacing blog links with arXiv abstract links, note that there are actually two papers linked, both interesting.

"Stratified B-trees and versioning dictionaries" http://arxiv.org/abs/1103.4282 This is actually the newer paper, and presents the "Stratified B-Tree" data structure itself. It also explicitly mentions SSDs and appending, by the way.

"Optimal query/update tradeoffs in versioned dictionaries" http://arxiv.org/abs/1103.2566 This is a longer and more, shall I say, academic paper. It appears to be incompletely finished, though, because it has a TODO, and also a marginal FIXME.

andytwigg · on April 12, 2011

Thanks for that. The second paper you reference contains more details of the data structure. I have submitted an updated version of the second paper - it will probably appear on arxiv in 2 days. We have a much improved version of it, though, which I hope to post sometime next week.

For more information about append-only B-trees and SSDs, see http://www.acunu.com/2011/04/log-file-systems-and-ssds-made-...

crux_ · on April 12, 2011

I don't suppose there's a chance of seeing the O'Caml implementation released as well, is there? It's neat to see it being used for what (IMO) is a really underutilized sweet spot for the language.

(That & the fact that a code release is an invaluable road map for anyone looking to follow in your footsteps.)

andytwigg · on April 20, 2011

The OCaml implementation is what we use internally to test implementations of complicated new data structures, and as such, it's quite unreadable to anyone else! We will probably talk at CUFP about our experiences of OCaml. It definitely has some upsides, but is not without some fairly major downsides (concurrency, serialization, lack of consistently-named library functions, ...) . As such, we're rewriting our basic data structures in Scala - this seems like it keeps most of the benefits but without some of the major problems. Now we know how to implement the structures, we might be able to do a clean implementation which can be released!

leef · on April 12, 2011

Thanks. The paper mentioned an open-source project for this. Does that exist yet?

andytwigg · on April 20, 2011

The core kernel code will be released in the next few weeks, along with a community site, probably code.acunu.com. We hope to post an announcement when that happens.

soamv · on April 11, 2011

Hmm, that looks like a different paper. Try this: http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.4282v2.pdf

viraptor · on April 11, 2011

Since they want to start with an implementation built into kernel libs, I wonder how easy would it be for btrfs to switch... Since the interface is similar, I really hope it's possible.

nwmcsween · on April 12, 2011

btrfs can't switch the structure is tied into the disk format.

andytwigg · on April 19, 2011

Slides from a recent talk http://www.acunu.com/2011/04/algorithms-at-cassandra-meetup/

krosaen · on April 11, 2011

Before even reading the paper, I lazily wonder if any of the techniques used have implications for clojure's persistent vector implementation

pmjordan · on April 11, 2011

The challenges faced with on-disk data structures compared to in-memory ones are quite different in some respects, so not necessarily. (e.g. disk seek time isn't uniform, indirection hits you MUCH harder, sectors are much bigger than cache lines, robustness against sudden loss of power is important, etc.).

That said, I can't tell what they're suggesting from a cursory reading, I'll need to spend some time deciphering it. (this one seems to fall into one of those unparseable academic publications where the interesting bits are buried somewhere deep within the text)

crux_ · on April 12, 2011

On the other hand, cache-oblivious algorithms (and upon first skim, the presented data structure is) should play just as nicely at the L1/L2 vs RAM level as they do at the RAM vs disk one, without needing to be specifically tuned or rewritten.

That's the theory, of course. ;)

pmjordan · on April 12, 2011

As danvet has hinted at, on-disk data structures don't exist in isolation. You've generally got a CPU and memory system attached to the disk that is orders of magnitude faster[1], so if you can reduce the number of dependent disk I/O operations by increasing the required processing power, that's usually a trade-off you want to take. As an example, linear searches through each 4kiB disk block you read are absolutely acceptable, whereas doing that for an associative data structure in memory is madness.

In any case, having inspected the 2 papers a little more in depth, this data structure is very different from Clojure's in that it holds multiple versions directly.[2] Additionally, it does indeed take all the trade-offs to minimise disk I/O in favour of more computation. I find it unlikely that much insight can be gained from this for in-memory data structures.

[1] you can expect ~10ms latency for spinning disks, ~10-100µs for solid-state drives, ~10ns for memory and ~1ns for the CPU. spinning disk to memory is 6 orders of magnitude, 3-4 OOM for solid-state drives.

[2] In Clojure the MVCC STM is implemented at the ref level, not the collection level. Doing it at the collection level may be possible, but probably not desireable as complexity will go through the roof.

andytwigg · on April 12, 2011

In general, I agree - as you'd expect, practical implementations of this structure will naturally deviate quite far from the paper's presentation, and indeed we do make many optimizations for SSDs and rotating disks. But, in theory, disk vs RAM is no different to RAM vs L2. A central trick is the 'stratification' procedure that operates on arrays - that could be helpful elsewhere.

crux_ · on April 12, 2011

I agree with everything you said (hence the winky hanging off of "theory"), except for this:

> I find it unlikely that much insight can be gained from this for in-memory data structures.

Reasons for my skepticism:

First, better big O results usually come with some valuable insights that can be reapplied.

Second, the tradeoffs in constant factors are really hard to predict. E.g. the extra CPU cycles may pay for themselves by increased cache coherency or better behaviour under multi-core contention. Hard things to even measure, let alone predict. ;)

Third, besides caring about absolute speed, degradation is important in some cases -- an in-memory-only implementation that the OS decides to swap out to disk won't hit (as much of) a brick wall if/when that happens.

Fourth -- although the extra complexity is prohibitive, MVCC tuned specifically for a specific data structure will almost certainly do way better on all kinds of metrics than a language/runtime-wide implementation. Appropriate for clojure? Not my call. Useful in general? Probably so.

danvet · on April 12, 2011

The problem is that the amount of raw cpu power you can burn for a storage miss is _much_ larger than for a cache miss. So the tuning is invariable completely different, leading sometimes to completely different algorithms.

pjscott · on April 12, 2011

The theory guarantees optimality to within a constant factor. I've seen some pretty big constants, though, so in practice of course you have to actually measure how it performs on a representative workload.

sigil · on April 12, 2011

So, anybody know if they're pursuing a patent on this tech? A quick Google Patent search didn't turn up any US results. I ask because this language from the paper sounded patent-y:

"In a more involved version of the construction, elements from higher levels are sampled..."

Reads like "in one embodiment of the invention, ..."