These are very niche use cases. Doing an entire benchmark on that basis without making that clear is misleading. And even in that exact use case, you will not see the massive performance difference that they are showing, because even in that case the data will have to be read from disk by the kernel, whereas in their benchmark the disk isn't even touched. Sure, in a database that doesn't support this, it might read the data from disk to memory, and then write it to a buffer, and that's going to be a bit less efficient than letting the kernel do it directly. You have an extra in memory copy, but that's still far from the performance difference that they are showing between disk read vs no disk read at all. So even in this ideal scenario, they are still massively overstating their performance (which should already be clear from the fact that no disk has 3 terabytes per second throughput). Designs like that are absolutely feasible in any database. You just have to do your own chunking. Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.
Better to make everything as simple and as easy as possible. App-managed chunking is error-prone, and it forces redundancy of implementations. Likewise, "you have an extra in memory copy" can make or break a system. OpenLDAP slapd using LMDB uses 1/4 the RAM of slapd using BerkeleyDB for a given number of entries. That means you can serve 4x as many entries using LMDB on a given hw config at full RAM speed, vs BDB. LMDB efficiency has a direct impact on hardware cost and electricity cost, as well as sw development cost.
We are doing all of the above. Microbenchmarks that show only the performance of the DB engine still serve a purpose. And the difference between the microbenches and the full benches shows you how much overhead is eaten in the rest of the system. E.g., when testing read performance of SQLite3 adapted to LMDB, we measured only a 1-2% improvement, because the majority of SQLite3 read execution time is spent in SQL processing, not in its Btree code.
Nice! Do you know why that benchmark show almost no read improvement but massive write improvement? The slides here (http://symas.com/mdb/20120322-UKUUG-MDB.pdf) suggest the opposite? Or am I reading them incorrectly?
As I understand it, that test used something like 32 server threads on a machine with only 4 cores. The UKUUG prezo used 16 server threads on a machine with 16 cores, which allowed all of the reads to run concurrently. The smaller number of CPU cores in the Zimbra test naturally reduces the concurrency advantage of LMDB.
I believe the test runs with 100 byte values, and the key is stored on the same page as the value. Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies. Note the library's tiny code size (and zero copy design) has another benefit, in that it can greatly reduce cache pollution, which may again help account for the speed difference.
> Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.
So basically you're saying there is no benefit because you aren't used to systems that support it.
> Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies.
How did you get this information? If that's the case, why did they not actually use the values (e.g. do something trivial like compute accumulated XOR or something). That would still show the advantages of the zero-copy design, but not ridiculously so.
It still seems highly suspicious. For example the benchmark with 100kb values runs twice as fast as the benchmark with 100 byte values?!
I certainly agree that the zero-copy design can be good, but these benchmarks are vastly overstating it to the point that the benchmarks are pretty useless. With these benchmarks it's impossible to know if that database is actually faster than the others. It could well be faster, but it could also be slower. To know that you need a fair comparison.
Note also that memory mapped designs have disadvantages as well. For example AFAIK memory mapping doesn't work asynchronously. You will block your threads, so you can't use lightweight threads or callbacks but you'll have to use heavyweight threads (which may limit concurrency and be expensive in general). Unless you implement your own OS of course (which interestingly some people are doing: http://www.openmirage.org/).
> So basically you're saying there is no benefit here because you aren't used to systems that effortlessly support it.
I'm not saying there is no benefit, I'm saying that it's a bad idea to rely on it that way for 500mb because for it to work safely every component has to support the same. Too easy to have an unexpected problem somewhere and have it blow up by copying 500mb when you only wanted 1kb. But it's still valuable for a database to handle large values well of course. Anyway, this is beside the main point.
Still nonsense. "AFAIK memory mapping doesn't work asynchronously" - clearly you don't know much. Reader threads in LMDB never block (except for pagefaults to pull data in, which are obviously unavoidable).
> except for pagefaults to pull data in, which are obviously unavoidable
Umm, obviously that's what I mean. And with async IO instead of memory mapping, you could avoid that. Hence why I said memory mapping has disadvantages. Clearly the "nonsense" was invented at your end.
Eh. You're still talking nonsense. If an app requests some data and it needs to be read in from disk, you have to wait. Async I/O or not, that app can't make any progress until its data arrives. Async I/O is only advantageous for writes.
That's incorrect, even for reads the app can do something else at the same time while the disk read is being performed. That's what async I/O is all about.
The app can do whatever else it wants to do that's completely unrelated to the data request. The part of the app that requested the data is stalled until the data arrives. If the data request is the only useful thing going on (e.g., a user clicked "search for Joe's phone number") then there's nothing else productive for the app to do and async I/O is irrelevant.
Yes. That last bit is very rare, especially in server applications where you have multiple clients. In any case you might want to be a little more careful before calling something that you didn't understand nonsense.
Could it be that your test is dominated by Python speed, given that their tests show 15-30 million operations per second but your test only half a million? Or is their hardware so much faster?
> Doing an entire benchmark on that basis without making that clear is misleading.
A benchmark which provides source code and test data should never be categorized as "misleading." At worst it puts too much trust in the average reader's understanding of things. But look at it this way: if you littered every page that included a benchmark with all the relevant caveats, it'd be a total mess.
Or as David Simon says, "fuck the average viewer."