Riffle: a high-performance write-once key/value storage engine for Clojure

jdp · on Nov 17, 2014

This is pretty similar to Sparkey[0] and bam[1]. Sparkey also comes from growing out of cdb's limitations. It supports block-level compression like Riffle does, and is optimized for accepting bulk writes. Riffle's linear-time merge behavior lifted from Sorted String Tables is a nice alternative to accepting writes at runtime. bam is cool in that it takes a plain separated values file as input, and builds an index file from a minimal perfect hash function over the input file.

[0]: https://github.com/spotify/sparkey [1]: https://github.com/StefanKarpinski/bam

prospero · on Nov 17, 2014

There are a lot of variants on this design out there, I had seen 'bam' but not the Spotify implementation. An additional constraint we had that I didn't allude to in the post was avoiding JNI, which adds some nasty failure modes for remote installations that can be very hard to debug. This meant any C implementation was off-limits for us.

It's unfortunate that using the JVM means that some wheels need to be reinvented, but those are the breaks, I guess.

fiatmoney · on Nov 18, 2014

"While memory-mapping is used for the hashtable, values are read directly from disk, decoupling our I/O throughput from how much memory is available."

Whether you're mmap'ing or using read(), you're hitting the page cache before you hit disc, and potentially evicting the LRU page thereof. Glancing through the source it doesn't look like they're using actual "direct IO" (which, in order to be performant, would have to have its own caching layer).

That being the case, for lots of tiny reads & writes I'd expect mmap to be superior to read() and write().

prospero · on Nov 18, 2014

A caching layer for random reads where the dataset is 10x larger tham memory isn't hugely useful. If you get a hit, great, but you can't count on it.

For memory mapping to make sense you need to fetch a big chunk of data, whereas read() gets a page's worth. For the data size and read pattern described in the post, the latter is much more desirable.

fiatmoney · on Nov 18, 2014

It's the opposite. As I said, read() and a read into a mmap'do array both hit the OS page cache first, and will bring in ~4K of data on miss; read() also has the overhead of a system call. For tiny read/writes the advice is to use mmap. This is different if you're doing "direct io" and bypassing the OS page cache because you have your own caching layer, but I don't think they do.

prospero · on Nov 19, 2014

Whose advice? Check out RocksDB's front page: http://rocksdb.org. Empirically what you're saying isn't true in my experience, rather mmap should be used when there's decent coherency w.r.t. the available memory. Without knowing what you're basing your belief on, I really can't address it.

fiatmoney · on Nov 19, 2014

It looks like they are making some claims about OS level bottlenecks specifically with the virtual memory subsystem. This is something I'd like to look into; all I can find is the particular quote but no explanation of where they think the bottleneck actually lies. The experience of, e.g., the SQLite folks seems to be different.

https://www.sqlite.org/mmap.html