Cache getting out of control and the server getting OOM killed needs to be fixed before I'd consider the database production-ready for large sets of data.
You are absolutely right. FYI, the rationale for this decision is that people are rapidly adopting Rethink and writing applications on top of it, so we decided to allocate most development resources to nailing down ReQL as quickly as possible so people won't have to rewrite their apps. Some operations issues (like #97) unfortunately had to temporarily take a backseat. I suspect we'll be able to resolve #97 in the 1.8 release.
RethinkDB will be marked production ready when it hits 2.0, though it will be ready for useful work way before that. (A huge number of people are already using it in their daily work and building important software on top of Rethink)
I've generally found the Rethink guys very level-headed and customer-focused. I made a feature request a year or so ago for a feature that was pretty outside the scope of a DBMS (custom memory/CPU limits for map/reduce operations) and they implemented it in a day or so, even though I was just asking if it's possible in theory.
They even sent me a small care package with a hand-written note, it was great of them. Marc, the guy who sent me the package, is now one of my regular DOTA2 teammates, and he's great at that too.
Major props to them, I really hope Rethink becomes as amazing as I think it will. I'll give it another spin over the weekend and write it up.
The most inconspicuous sign of RethinkDB taking over the world is what they named this release: Nights of Cabiria. It shows they have the strength to laugh at themselves, at how hard it is to build something like this, and how people have treated them so far.
These are interesting parallels and I suppose much of this might be true subconsciously, but it wasn't premeditated. For each release we just have a random person on our team pick a classic movie. @mglukhovsky picked this one; I don't think he thought it through that far, though :)
Under semver (which isn't the only versioning scheme, but certainly among the more common of the ones that have a firm definition of what 1.0 means), it means "stable API", which doesn't necessarily mean "production ready".
Version numbers don't have on single universally accepted set of defined semantics.
If I remember correctly, the RethinkDB team opted to match their in-house version numbers with public releases. So the 1.x releases would be more comparable to 0.x releases for other projects, and 2.0 would be considered stable and production ready.
Not directly related to the new version, but speaking of upgrading previously, make sure you do the migration scripts beforehand if you have a current server.
I make the mistake of upgrading and not migrating beforehand from 1.4 to 1.6 and couldn't find an version of 1.4 since all the archives were down and building from source on a VPS just wasn't happening. The Rethink team was amazing in their support of building the old version for me specifically, and if this is indicative of their dedication to user support can't wait to see this become a big success.
slava @ rethink here. Thanks for the vote of confidence, and sorry for the poor experience. The product is improving very quickly, and we decided to trade off format stability for speed of development until we hit 2.0 (at which point we'll freeze the formats and put more resources into backwards compatibility).
This latest release has actually prompted me to check it out. I had been meaning to but hadn't found an exact usecase for it yet, now I'm just going to try to build something with it, though I don't know what yet.
This kind of dedication to it is definitely what makes me want to check it out, you guys are doing some really great work on it.
One aspect of micro-benchmarks that is most of the time ignored is that they reveal different default settings various systems come with. And unfortunately after seeing the results, not too many dig deeper to figure them out.
Indeed. And the default settings reveal something about a system's priorities. If you care about your users' data, then Mongo's default settings should be a giant red flag. The revealed priority is seeming fast.
That was actually one of the explicit design goals of Rethink -- pick defaults such that users never have to wonder about the safety of their data. I know the folks at Riak are also in this camp, so there are definitely NoSQL dbs that do this well.
How about "they should be an invitation to learn more about the system" :D? As far as I can tell, in many cases these defaults have almost never been reviewed by devs because most of the early users already knew what tweaks they needed.
I dunno. Personally I'm a big fan of the "sane default or no default" approach. I think that mongo's former approach is irresponsible engineering (I understand they have fixed it recently).
Well, on a PG 9.1.9 instance, which has fsync=on and an older Intel i7 on an Intel SSD, my 8-process program inserting a 120-byte TEXT value into a (SERIAL, TEXT) table, with transactions each doing 10 inserts, manages to finish inserting 100,000 groups of 10 "documents" in 37 seconds. So that's roughly 27,000 documents inserted per second (not sure if they are measuring individual inserts documents or groups of 10).
According to iostat, this does 5000 write transactions per second to the disk.
If you are logging throw-away data, you can use an UNLOGGED table. This brings it to 49,000 inserted rows per second.
If you don't care about your data integrity, and set fsync=off, it goes up to 65,000 and seems to be limited by CPU.
Very cool, thanks for the numbers! A few bits of info on the benchmark in the post.
It uses a single process/thread/connection. Adding more concurrent clients would improve performance, though there is still some scalability work to be done.
Ops/second in the graphs measures number of documents, not groups.
One fundamental aspect of the benchmark is that it uses random keys, which is significantly slower (in any database) than using sequential inserts. I'm not sure how you insert into Postgres wrt to keys.
I suspect postgres has better latency, and since the benchmark in the post is very much latency bound, this would account for the factor of four performance difference. It's something we still need to address. EDIT: 1 million documents / 37 seconds / 8 connections = 3378 ops/second, which is about a factor of four off from 800 ops/second that RethinkDB does on this. (This is obviously back of the napkin math.)
There's lots of work to be done, I'm very much looking forward to publishing authoritative, scientific comparison benchmarks.
For postgres I've been using several threads, each with its own connection, and all going through the same transaction (using the number exported by pg_export_snapshot()) - with this the speedup is quite big (c++ project using libpq).
I expect that if I move to binary format for inserting it might be even faster. I'm basing this since my reader is binary and that made x3-x4 improvements for my data set/computer/etc.
This is a bit misleading -- the performance difference between 1.6.1 and 1.7 ranges from 2x to >1000x depending on the circumstances (size of documents, durability, whether the client waits on network reply, etc.) This also doesn't account for 10-100x performance improvements on the batch insert algorithm made prior to 1.7.
I suspect that RethinkDB 1.7 would be comparable to Mongo on insert performance workload described in this stack overflow thread.
It was not deliberately misleading - it's simply the top result when doing a google search on rethinkdb and mongo performance. It's a shame that decent, general case database benchmarks are so hard to find.
Sorry, I meant to say that the benchmark in the release announcement is misleading (since it doesn't address all relevant issues), not the link in your comment.
We're working on getting authoritative benchmarks out, but unfortunately good benchmarks are extremely time consuming (much like any science experiment).
Is there a plan to fix:
"If the machine is acting as a master for any shards, the corresponding tables lose read and write availability (out-of-date reads remain possible as long as there are other replicas of the shards).
...
Currently, RethinkDB does not automatically declare machines dead after a timeout. This can be done by the user, either in the web UI, manually via the command line, or by scripting the command line tools."
I have been working on the Java driver for rethink for a bit now and I will tell you the DB seems like an awesome mix of Mongo and Riak, Mapping objects into the DB is super easy and doing datacenter aware operations is amazingly easy.
Right now you should have complete functionality, but Its not fully tested and there are some "convience" methods that need to be added to some of the ReQL classes to make it mimic the official API ( for example row is not a member function of connection yet so r.update(r.row("foo")) doesn't work yet )
FYI-- for those looking to install on Ubuntu, we have a build problem with one of our dependencies for the web UI. We're rebuilding the packages now; this should take about half an hour on Launchpad. I'll update this comment when the new package is available.
Edit: Packages have been uploaded to Launchpad, waiting in queue to build. (12PM PST)
This is a really nice update. Migrating data between releases with a giant Ruby script feels like a hack each time there's an update. Insert speed has been a real annoyance. Fetching multiple keys is really nice (instead of map/filter the entire thing). And expanding pluck() to nested documents makes so much sense (I was worried ReQL would be limited to manipulating top-level documents).
Overall, an exciting release. Going to upgrade and see what insert speed is like on my setup.
Sorry, I misunderstood your question. In this specific benchmark the bottleneck is in the network and disk latency (since the benchmark sends out a batch of writes, waits for a server acknowledgement, which in hard durability mode means waiting on disk), and then sends out the next batch.
When we use a benchmark that doesn't bottleneck on latency (by adding more concurrent clients, or by using noreply) the ops throughput approaches theoretical IOPS throughput of the SSD.
It's cool to see atomic operations.
Any plans to implement multi-document ACID transactions? Just like all_or_nothing in pre 0.9 couchdb when in case of conflict update was being rejected.
I'm the main developer for RethinkDB Cloud. It's pretty rough at the moment. You can expect 1.7 instances available in the over the weekend, along with smaller shared instances, and lots of various updates. If there are any questions or if anyone wants to be apart of the Heroku add-on testing you can contact me at cam@rethinkdbcloud.com .
While both CouchDB and RethinkDB store JSON, the differences between them are more radical. I cannot post an as-extensive comparison as the one with MongoDB, but here are some aspects.
Please keep in mind that this is not an authoritative comparison and it may contain mistakes. Plus as for many such systems, the aspects covered are in reality not that easy to be described in just a few words.
Platforms:
- RethinkDB: Linux, OS X
- CouchDB: where Erlang VM is supported
Data model:
- both JSON
Data access:
- RethinkDB: Unified chainable dynamic query language
- CouchDB: key-value, incremental map/reduce
Javascript integration:
- RethinkDB: V8 engine; JS expressions are allowed pretty much anywhere in the RQL
- CouchDB: Spindermonkey (?); incremental map/reduce, views are JS-based
Yeah, I thought about that, but it seems like you can only use them to get the "map" part of map/reduce... no aggregation. Unless I'm missing something.
> Yeah, I thought about that, but it seems like you can only use them to get the "map" part of map/reduce... no aggregation. Unless I'm missing something.
Yes. RethinkDB is really well set up to do this due to the underlying parallelized map/reduce infrastructure. This feature is a matter of scheduling priorities. I don't have an ETA yet, but it will almost certainly get done in the medium-term.
https://github.com/rethinkdb/rethinkdb/issues/97
It's a little disappointing that it has not been resolved for several releases now.
Is there a target milestone for a production-ready release of RethinkDB? Is it 1.8 or 1.9?