Design Review: Key-Value Storage

jimis · on Nov 9, 2018

The page writes that LMDB "is exceptionally fast for our kinds of load", and then links to an in-memory microbenchmark: http://www.lmdb.tech/bench/inmem/

Aren't they interested in persistence of the key-value data? In my experience, once data is persisted to disk or SSD, LMDB is way slower from alternatives because it needs to operate in synchronous mode to avoid corruption (effectively flushing to disk after after every transaction committed). If operated in the non-default MDB_NOSYNC mode (which is the mode chosen in the above benchmarks), then there is a high probability to be left with an unreadable database file after a crash, thus losing all your data.

It is not fair to compare with other databases in sync mode, since they might operate safe but faster in async mode. For example sqlite with PRAGMA journal_mode=WAL and PRAGMA synchronous=NORMAL can operate in semi-asynchronous mode (fsync()ing sporadically) without fear of corruption in case of crash, because it keeps a WAL journal and is able to properly roll-back after a crash. This should be much faster than LMDB's default-and-safe synchronous mode, that msync()s on every value written.

malkia · on Nov 9, 2018

Concidentally, I'm in the middle of testing out LMDB as replacement for SQLite - we've been running fine for a year with journal_mode=WAL, but synchronos=NORMAL without issues, and just added last week LMDB, but with MDB_NOSYNC... hmm... so far our real testing showed (for our use case) almost same results, except some 95-99% percentiles where SQLite goes much slower (but due to other factors). In synthetic benchmarking (suiting our needs) the difference is miniscule..

jimis · on Nov 9, 2018

If you are using MDB_NOSYNC in production, keep in mind that the trade-off for the boost in performance is that a system crash can lead to an unreadable database.

malkia · on Nov 9, 2018

Yup, we had SQLITE in the same fashion (synchrnous=OFF, and other non-safe settings, no WAL also) and had occasional crashes, then someone had to manually delete the database (in our case it's for a desktop app, a tool, so no server-like requirements, yet we want it to be stable).

Would remove the MDB_NOSYNC then, and see how it goes...

hyc_symas · on Nov 9, 2018

We have only ever compared LMDB in synchronous mode to other DBs in synchronous mode, and LMDB in asynch mode to other DBs in asynch mode. Come on, that's too obvious.

And LMDB beats the crap out of SQLite, in any mode. http://www.lmdb.tech/bench/microbench/

Replacing SQLite's Btree engine with LMDB makes the SQLite footprint smaller, faster, and more reliable too. https://github.com/LMDB/sqlightning

jimis · on Nov 9, 2018

> And LMDB beats the crap out of SQLite, in any mode. http://www.lmdb.tech/bench/microbench/

Alright, I went through this page to see why my experience is different. While it mentions that they enabled SQLite's WAL journal, it also mentions that the synchronous writes were performed with PRAGMA synchronous=FULL.

I believe that if they set PRAGMA synchronous=NORMAL, they will get an ACI-reliable database that is way faster than LMDB with write-to-disk workloads.

Most of the measurements on that page are not really useful to me; they do not persist the data (db on tmpfs) or they do not care about reliability, using dangerous settings. Only few measurements are both persisting the data and writing safely, but the SQLite configuration is sub-optimal for them.

jimis · on Nov 9, 2018

It's not that simple. What you mean by saying "(a)synchronous mode" is very different from database to database. See my above comment. Is SQLite synchronous or asynchronous if you configure it as journal_mode=WAL and PRAGMA synchronous=NORMAL?

For my purposes as an application developer, I care about comparing databases operating in safe mode i.e. a system crash should never cause total data loss. According to my experience SQLite's safe mode is many times faster than LMDB's safe mode, with a write workload. While LMDB is thrashing the disk and achieves only a handful of write transactions per second.

hyc_symas · on Nov 9, 2018

SQLite's safe mode is not comparable to LMDB's; SQLite is vulnerable to silent data loss in a crash.

https://wisdom.cs.wisc.edu/workshops/spring-14/talks/Thanu.p...

LMDB is not.

jimis · on Nov 9, 2018

Isn't this from the paper "all filesystems are not created equal"? [1] If you search the tables for "sqlite-wal" you will see that it shows zero vulnerabilities.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14...

hyc_symas · on Nov 9, 2018

Ah, the earlier result was for SQLite-rollback, their default mode. IMO any DB should default to its safest mode.

marlinsearch · on Nov 9, 2018

I have been comparing multiple key-value stores and lmdb has been the simplest and fastest to use with or without MDB_NOSYNC. It beats almost every kv store in writes.. and in reads it is untouchable by anything by a large magnitude.

jimis · on Nov 9, 2018

Agreed for the read workloads, I have the same experience. It was designed primarily for an LDAP directory after all, an application that is very read-heavy.

ovao · on Nov 9, 2018

Where on this page does it say that LMDB was run with MDB_NOSYNC? Tests were run with the DB on a tempfs and then again (lower down on the page) on ext4.

jimis · on Nov 9, 2018

Section 4:

"using a 512GB Samsung 830 SSD and an ext4 partition.

The actual drive characteristics should not matter because the test datasets still fit entirely in RAM and are all using asynchronous writes"

CardenB · on Nov 9, 2018

How does this compare to sstable?

_ieq6 · on Nov 9, 2018

Didn't they review badger from dgraph.

https://github.com/dgraph-io/badger

Here is the comparison back in 2017: https://blog.dgraph.io/post/badger-lmdb-boltdb/

It supports concurrent ACID transactions with serializable snapshot isolation (SSI) guarantees.

drzaiusx11 · on Nov 9, 2018

Hopeful this goes better than WebSQL. WebSQL ran into standardization problems because they started with an existing implementation (SQLite) instead of a spec as their base implementation--meaning anyone making a greenfield implementation would need to implement all the quirks of SQLite to be compatible with other implementations. Sadly it died as a standard and we were stuck with either local storage or indexdb.

Mossop · on Nov 9, 2018

There is no intent to standardize this or expose it to the web. This is purely for use internal to Firefox. Of course that may end up meaning that it is used as the internal storage for some web feature.

drzaiusx11 · on Nov 9, 2018

The first sentence on that page says "We propose the standardization of a simple key-value storage...usable from JS, Java, Rust, Swift, and C++"

I assumed this meant an API callable from webasm/js. Did I miss something?

drzaiusx11 · on Nov 9, 2018

Ah, found my error:

"Not-yet or never goals for this proposal are:

Standardization via a standards body as a web API."

So this is "internal" stuff I guess.

Mossop · on Nov 9, 2018

Poor wording perhaps. There are a bunch of places in Firefox that use key-value stores, this proposes standardising on one type of store.

lclarkmichalek · on Nov 8, 2018

I'm really surprised LSM trees didn't get more commentary. 'All are targeted at server workloads' - sure, but they're also incredibly popular and appear to be as close to the 'one-size-fits-all solution for storage' as we've found.

_pgmf · on Nov 8, 2018

Also typically more complicated and require a separate compaction process. They're good for writing lots of data, but not so great for random reads.

mrjn · on Nov 8, 2018

Author of Badger here. Our design of separating keys and values has gotten us incredibly fast writes, while still keeping the read latencies neck-to-neck against B+ trees. Worth checking out: https://github.com/dgraph-io/badger

digikata · on Nov 9, 2018

Badger looks nicely done! Did you end up needing to change much in implementing Badger from what was described in the WiscKey paper?

dana321 · on Nov 8, 2018

Badger is really good, i'm using it.

nekitamo · on Nov 8, 2018

This looks very interesting. Do you have any builtin compression or tips on compressing data?

guachesuedehack · on Nov 9, 2018

Nice, why the choice of go ?

_pgmf · on Nov 9, 2018

[flagged]

dang · on Nov 9, 2018

Please don't be a jerk in comments here.

Edit: sadly, it looks like you've been posting quite a few uncivil comments to HN. We ban accounts that do that, so would you please review https://news.ycombinator.com/newsguidelines.html and follow the rules from now on?

The idea here is: if you have a substantive point to make, make it respectfully and thoughtfully; if you don't, please don't comment until you do.

_pgmf · on Nov 9, 2018

Hey, I know you care a lot about this forum. Sometimes its easy to forget what this place means to some folks and just do a drive-by shitpost. I hear you.

catwell · on Nov 9, 2018

In case someone involved in this reads this thread: the document does not specify which LMDB version you tested. I suggest you run your tests with the `mdb.master` branch, i.e. the work towards a future 1.0, and not the stable 0.9 branch. The answers to several of your interrogations will depend on that: with `mdb.master` you can use the VL32 mode which greatly improves usage on 32 bit platforms, and Windows support is much better.

Regarding NFS: I have recently started testing LMDB on NFS v4 and had no issues so far, but with a single process using the database. AFAIK the warning at http://www.lmdb.tech/doc/ is only for multiple processes using the DB concurrently. I am still not entirely sure there won't be any mmap-related issues, but so far so good.

Regarding "being careful": this is a very important point. The LMDB API does not hold your hand, it lets you do dangerous things which will result in corruption of your database, which you will discover too late. I suggest writing a wrapper around the API to ensure you are using it correctly. (I wish there was a compile flag like LUA_USE_APICHECK [1] for LMDB, which could help detect problems like this, but there isn't.)

[1] https://www.lua.org/manual/5.3/manual.html#4

hyc_symas · on Nov 9, 2018

The main reason for that warning about NFS is that people will try it, see that it seems to work, and then get careless and try to use the same DB from two different hosts at once. It's inevitable when you're working on files living on networked filesystems, and it cannot work. NFS doesn't offer any cache coherency guarantees, and the mutexes used for synchronizing writers only work on the host that created them.

Not sure what you're talking about re: the API letting you corrupt your database.

justin66 · on Nov 9, 2018

If we're using lmdb for something that's largely or entirely read only, how viable is using it on NFS with (shudder) lock files or something like that?

hyc_symas · on Nov 9, 2018

If it's 100% read-only, you could probably use it safely. Make sure its filesystem on the NFS server is mounted read-only, and obviously all the clients' NFS mounts must also be read-only. As for "largely read only" - if there are any writers at all, all bets are off.

LMDB automatically detects read-only filesystems, and turns off its locking in that case, so it should perform as well as anyone could expect NFS to perform.

catwell · on Nov 10, 2018

Things like closing DBIs, transactions and cursors in the wrong order, for instance.

hyc_symas · on Nov 11, 2018

That won't corrupt the DB. It might corrupt the heap of your current process, but the DB won't be harmed.

C doesn't have destructors. If you free stuff in the wrong order, you lose. That's nothing to do with LMDB's API. Even if LMDB cleaned everything up internally, you'd still have dangling pointers in your app space. No language or API can prevent that.

The only thing you can do in the API that might possibly corrupt the DB is to use MDB_RESERVE and then store a value bigger than the space you reserved. That generally causes a SEGV, and you'll discover very quickly that you have a bug in your code. LMDB will fail fast, every time, and every failure will be a bug in your own code. Makes debugging very quick.

catwell · on Nov 12, 2018

Glad to hear that.

To be honest the only real corruption issues I had with LMDB in practice were dues to me using MDB_NOSYNC, and broken mmap behavior on some Android devices (on external storage).

adobeeee · on Nov 9, 2018

I have a layman question if somebody could please answer. I have never in my entire life seen databases fail. But db failures and issues seem to be brought up all the time. Now I understand that part if this maybe the cost function associated with them. But I'm sure there's also something that I have no clue about. So my questions are:

1) what kind of problems do databases actually face.

2) what kind of scenarios create those problems.

3) how does a programmer go about testing them?

mbreese · on Nov 9, 2018

The easiest scenario to imagine is a hardware failure or power outage. The database was in the middle of doing something, and then was prevented by a hard drive dying or the lights going out. One way to test such a thing is to literally unplug the computer to see how it handles the failure.

So, let's say you have a client/server application... the client is telling the server (database) to write some records to the database. In the middle of the write, you pull the plug. Some questions you'd want to know: what does the database look like when it restarts? Can we read it? What is the current state? Did any of the new data get written? What does the client think was written? If there was an uncommitted database transaction, was the database left unaltered?

It's just as important to test the client in these scenarios. While the server may have crashed, what does the client think happened? Was it waiting for an ACK or "OK" message? Did it get the message? If the update failed, what does the client do in that situation?

Things can get even more complicated if you're thinking of replication across different servers. If one of the servers fails, how does the replication work? Do sessions fail over to other servers? How many servers are required? If there was a corrupted record, did it propagate or was it scrubbed?

adobeeee · on Nov 9, 2018

Thank you for explaining so well and clearly!

To you and others, are there any other scenarios too that happen in production?

philix001 · on Nov 9, 2018

Enough material here to scare anyone about databases

https://jepsen.io/talks

marcosdumay · on Nov 9, 2018

Some things I have seen in production:

The disk become inoperative during a write, this can be either silent or writes start to return errors. Again, how does the database look like after the problem is solved.

A large operation exceeds the capacity of the server to deal with intermediary state. It runs out of memory, disk, or in some not great DBs it loses control of some locks and gets deadlocked. Can it recover with only the partial log data?

Disks lie about data being written, what happens if one of the problems happen between the disk saying the data was written and it actually getting written?

And, of course, when you move beyond a single server things get way more complex.

positr0n · on Nov 9, 2018

You'd be surprised by how often I've seen a database fail in prod simply because it ran out of disk space. In both cases monitoring software was running but misconfigured.

tfha · on Nov 9, 2018

We ship a consumer application with databases in it (multiple). Even the ACID ones fail all the time, maybe 1-3% of our userbase has had a corruption at some point.

adobeeee · on Nov 11, 2018

Can you talk about the reasons of the failure?

shekispeaks · on Nov 8, 2018

This doc says LevelDB has no transacion, thats not true, they have batch writes and LEVEL-DB is not implemented in Go. Its implemented in CPP

derefr · on Nov 9, 2018

> LEVEL-DB is not implemented in Go. Its implemented in CPP

There is more than one implementation of LevelDB. This one (https://github.com/syndtr/goleveldb), in Go, is used in major projects such as https://github.com/ethereum/go-ethereum.

floatingatoll · on Nov 9, 2018

Transactions let you perform any series of SQL commands with various expectations around data safety and locking guarantees depending on isolation level.

Batch writes provide a tiny subset of the full possibilities of transactions. While sufficient in many cases, that cannot be generalized to "LevelDB supports transactions".

romed · on Nov 9, 2018

That's just your pet definition of transaction. That is not universally accepted.

barrkel · on Nov 9, 2018

Atomicity and consistency. The classic case is balance transfer. Batch writes only cope with this to the degree that you can model your computation purely in database primitives that can execute server-side as part of the batch. If your computation involves anything else, such as external data or a computation that can't be modelled, then microbatches aren't good enough: you risk lost updates and inconsistent reads.

floatingatoll · on Nov 11, 2018

Universal acceptance is not a valid criterion for human beings at scale.

Schizophrenic people don’t accept facts like “this wall exists”, yet we still reach agreement as a society that a wall exists and don’t try to walk through walls because they exist.

You are of course welcome to hold any belief you wish, but believing something that puts you in direct contradiction to an entire industry significantly raises the bar of proof you must provide in order to convince others to listen to you.

Your reply does not provide that proof, and thus your argument is not persuasive.

Twirrim · on Nov 9, 2018

I would love to see an example of someone actually knowledgeable about databases who has a different definition for transaction.

erpellan · on Nov 9, 2018

Atomicity, consistency, isolation, durability.

https://en.wikipedia.org/wiki/ACID_(computer_science)

But that's just like, the industry's opinion, man.

Twirrim · on Nov 9, 2018

Maybe I completely misunderstood the parents description of a transaction, but that was exactly what they were saying.

sidcool · on Nov 9, 2018

This is a great example of how to conduct a technical design review.

mamcx · on Nov 8, 2018

I see their have some questions (like how good is the windows and android support) that are not answered? or only internally? I think will be good to see what them found.

hyc_symas · on Nov 9, 2018

Windows and Android are fully supported, have been for years. As are iOS and MacOSX and all the BSDs.

mamcx · on Nov 9, 2018

Thats great to know, and I suspect that. But what I'm saying is that them ask some valid questions but where are the results of their findings?

Because if I read this, I could conclude LMDB could have troubles in that areas...

jinqueeny · on Nov 9, 2018

Curious about how this compares to TiKV...

AboutTheWhisles · on Nov 9, 2018

It really sounds like they should take a look at this:

https://github.com/LiveAsynchronousVisualizedArchitecture/si...

"One appealing aspect of LMDB is its relative ease of use from multiple processes, above and beyond its basic capabilities as yet-another-fast-key-value-store."

simdb is only lock free and thread safe. While LMDB is benchmarked at around 10k writes, this should be able to do millions of mixed reads and writes with 4 modern cores. LMDB seems to use a separate lock file to sync multiple threads/processes. The only catch here is that the keys aren't sorted, which doesn't seem to be a requirement of theirs.

erichocean · on Nov 9, 2018

LMDB is awesome, can't tell how many times I've wished to have access to it from the browser...

stevewilhelm · on Nov 8, 2018

How does this compare to Redis?

detaro · on Nov 8, 2018

Redis isn't embeddable, but a standalone server application, and thus it's not really the same space.

bgilroy26 · on Nov 9, 2018

Thank you for asking this question. I don't know a lot about databases, but I would have thought redis, as a major key-value store, would have been part of this evaluation. So your answers were helpful.

nosequel · on Nov 8, 2018

It actually stores data.

smacktoward · on Nov 8, 2018

Bring back Mork!

ducks

jcranmer · on Nov 8, 2018

Mork actually sucked even as a key-value store. It's only decent if your requirements are a) only lookup on a fixed, autoincrement integer ID, b) the only operation you're likely to do is load an entire record or store an entire record at once, and c) parallelism is not in your vocabulary.

Disclaimer: I'm one of the last people to make functional changes to mork.

majewsky · on Nov 9, 2018

That's not a "Disclaimer". It's actually the polar opposite... a "Claimer"? ... I think "Source" is the best word to use there.

pvg · on Nov 9, 2018

Welcome to the war, majewsky.

https://news.ycombinator.com/item?id=17607457#17618288

majewsky · on Nov 9, 2018

Hah! :) I've been meaning to call out other misuses of "Disclaimer" vs. "Disclosure" before, but I usually don't because it's just tedious.

espeed · on Nov 9, 2018

NB: NB works for me.

amelius · on Nov 8, 2018

Seems like an instance of reinventing the wheel ...

wvenable · on Nov 8, 2018

From the article:

"We propose ‘buying’, not building, the core of such a solution, and wrapping it in idiomatic libraries that we can use on all platforms.

We propose that LMDB is a suitable core (see Appendix A for options considered): it is compact (32KB of object code), well-tested, professionally maintained, reliable, portable, scales well, and is exceptionally fast for our kinds of load. We have engineers at Mozilla with prior experience with LMDB, and their feedback is entirely positive."

cpeterso · on Nov 8, 2018

How so? The project is building upon an existing key-value store (LMDB) to replace some custom data stores in Firefox using flat text files, JSON files, Kinto, and SQLite.

ska · on Nov 8, 2018

Curious comment. My reading of it is exactly the opposite of reinventing any wheels.

alexnewman · on Nov 9, 2018

Why do they make this proposal? What does that mean?

Mossop · on Nov 9, 2018

It's a proposal for making use of a new storage engine to store Firefox internal data.

acqq · on Nov 8, 2018

I thought that the issues like this would be blocking:

https://stackoverflow.com/questions/44407659/how-to-force-64...

Who knows how it should work on 32-bit systems?

And isn't endianess also a problem? And doesn't SQLite solve both problems by default? And isn't also possible to configure SQLite to be very fast, if one know what he's doing?

Btw I'd expected that the "notes here" https://docs.google.com/document/d/1bwbpqPb58a0GcEyB4W424pyf... are conclusions, but the document seems to be not publicly accessible.

mykmelez · on Nov 9, 2018

According to @hyc in https://monero.stackexchange.com/questions/2606/are-the-lmdb..., "As of v0.10.0, yes the LMDB files are cross-compatible between 32 and 64bit architectures. They have always been cross-compatible between OSs. They are still byte-order dependent but almost everyone uses little-endian CPUs these days so it's not much of an issue."

acqq · on Nov 9, 2018

So the endianness answer is, from your link: "They are still byte-order dependent but almost everyone uses little-endian CPUs these days so it's not much of an issue." Which just means "we solve the problem by ignoring it completely."

And what about the limitations on the 32-bit system? Isn't there also needed to use memory space of RAM for the complete size of the database, that's how that database works if I understood? Which makes LMDB effectively unsuitable for 32-bit systems:

https://stackoverflow.com/questions/52862176/lmdb-open-large...

"your user will need to either enable PAE on their system, or upgrade to 64-bit CPU. If neither of these is an option in your application, then you cannot use a memory mapped file larger than your available address space"

In short it still looks that LMDB is designed effectively only for one-endianness and only for 64-bit systems, which is still limiting for many use cases.

Again, it seems that for LMDB is again "solving" the problem by ignoring it. Which is OK if it fits your use case... But shouldn't Firefox work properly on 32-bit systems? Or did they completely decide that they don't want to target any 32-bit platform anymore?

Moreover, there are use-case scenarios where the memory mapped file approach can suffer from problems of unnecessarily reading the page that will anyway be completely overwritten. My conclusion is... if LMDB "works for you" fine, but do properly your research first. I still believe SQLite covers much more use cases and is safer starting point for many of them, including Firefox use cases, until I read that they really decided to reduce these.

hyc_symas · on Nov 9, 2018

The endianness answer is "we thought about it, and it's a non-problem 99% of the time, and not worth further thought." You can't use a DB concurrently from multiple hosts, so it's unnecessary to support both endiannesses at once. If you need to copy a DB between different endian hosts, export it with mdb_dump and reimport it with mdb_load on the other machine. Don't be an ass.

There are no limitations on 32bit systems using MDB_VL32 in the 1.0 branch, you're spewing crap without any idea what you're talking about. Whoever's writing on stackoverflow doesn't know what they're talking about either.

> Moreover, there are use-case scenarios where the memory mapped file approach can suffer from problems of unnecessarily reading the page that will anyway be completely overwritten.

That's only true if you use a writable mmap, and the DB is larger than RAM. The default for LMDB is not to use a writable mmap, so this isn't an issue that affects most LMDB users. It is also already documented, so you choose it at your own risk.

mykmelez · on Nov 9, 2018

> Again, it seems that for LMDB is again "solving" the problem by ignoring it. Which is OK if it fits your use case... But shouldn't Firefox work properly on 32-bit systems? Or did they completely decide that they don't want to target any 32-bit platform anymore?

Yes and no, respectively. Yes, Firefox should work properly on 32-bit systems; and no, we didn't completely decide that we don't want to target 32-bit platforms anymore.

Rather, Firefox plans to use LMDB where it fits one of its many use cases for persistent storage (and not use it where it doesn't).

The StackOverflow thread you referenced describes a blockchain program that expects to use "a few GB of lmdb diskspace." Whereas Firefox's use cases for LMDB are sized in the range of a few KiB to a few MiB. Firefox has no plans to start storing a blockchain.

Claimer: I'm the engineer integrating LMDB into Firefox.

jnwatson · on Nov 9, 2018

The last issue is merely one of performance, not capability.

I’ve successfully used LMDB in 32-bit land, though it takes some effort. I had to scan available virtual memory for the largest contiguous chunk and use that.

Bigger issues are growing the database size and performance in low memory. No transactions can be running to increase the map size, which is usually hard to coordinate.

Also, in low memory situations, LMDB’s performance suffers tremendously. It can be 100x slower, and commits can take seconds. You won’t run into it unless you are really hammering it, and the developer usually won’t notice because they usually have lots of RAM.

hyc_symas · on Nov 9, 2018

Fwiw, we've benchmarked on Raspberry Pis with slow SDcards. You want to talk about low RAM situations and slow I/Os. The reality is still the same though - every other DB engine is many times slower under the same conditions.

Even when the DB is 5x or 50x larger than RAM... http://www.lmdb.tech/bench/hyperdex/

hyc_symas · on Nov 9, 2018

re: 32bit land, You should look into MDB_VL32 in mdb.master as other posters have suggested.

If any single commit takes multiple seconds, you've probably written too much data in a single transaction. Above about 512MB it starts preemptively flushing intermediate pages out, to keep the in-RAM footprint limited. Some of those intermediate pages will need to be read back in if your transactions are hitting all over the DB. This is where the main slowdown comes.