Hacker News new | past | comments | ask | show | jobs | submit login
RethinkDB 1.7: hot backup, atomic set/get, 10x insert performance improvement (rethinkdb.com)
167 points by coffeemug on July 3, 2013 | hide | past | favorite | 66 comments



Cache getting out of control and the server getting OOM killed needs to be fixed before I'd consider the database production-ready for large sets of data.

https://github.com/rethinkdb/rethinkdb/issues/97

It's a little disappointing that it has not been resolved for several releases now.

Is there a target milestone for a production-ready release of RethinkDB? Is it 1.8 or 1.9?


You are absolutely right. FYI, the rationale for this decision is that people are rapidly adopting Rethink and writing applications on top of it, so we decided to allocate most development resources to nailing down ReQL as quickly as possible so people won't have to rewrite their apps. Some operations issues (like #97) unfortunately had to temporarily take a backseat. I suspect we'll be able to resolve #97 in the 1.8 release.

RethinkDB will be marked production ready when it hits 2.0, though it will be ready for useful work way before that. (A huge number of people are already using it in their daily work and building important software on top of Rethink)


That's a reasonable rationale. Thanks!


I've generally found the Rethink guys very level-headed and customer-focused. I made a feature request a year or so ago for a feature that was pretty outside the scope of a DBMS (custom memory/CPU limits for map/reduce operations) and they implemented it in a day or so, even though I was just asking if it's possible in theory.

They even sent me a small care package with a hand-written note, it was great of them. Marc, the guy who sent me the package, is now one of my regular DOTA2 teammates, and he's great at that too.

Major props to them, I really hope Rethink becomes as amazing as I think it will. I'll give it another spin over the weekend and write it up.

Congrats on the new release, guys!


The most inconspicuous sign of RethinkDB taking over the world is what they named this release: Nights of Cabiria. It shows they have the strength to laugh at themselves, at how hard it is to build something like this, and how people have treated them so far.

I'd encourage you to learn more about that movie.

https://en.wikipedia.org/wiki/Nights_of_Cabiria


These are interesting parallels and I suppose much of this might be true subconsciously, but it wasn't premeditated. For each release we just have a random person on our team pick a classic movie. @mglukhovsky picked this one; I don't think he thought it through that far, though :)


Is there a target milestone for a production-ready release of RethinkDB? Is it 1.8 or 1.9?

1.0 is supposed to mean production-ready. Are you suggesting this should be a 0.7 release?


Under semver (which isn't the only versioning scheme, but certainly among the more common of the ones that have a firm definition of what 1.0 means), it means "stable API", which doesn't necessarily mean "production ready".

Version numbers don't have on single universally accepted set of defined semantics.


If I remember correctly, the RethinkDB team opted to match their in-house version numbers with public releases. So the 1.x releases would be more comparable to 0.x releases for other projects, and 2.0 would be considered stable and production ready.


slava @ rethink -- this is exactly right. Sorry for confusion.


This is similar to node.js versioning, I think.


+1 for the ease of export and import now!

Not directly related to the new version, but speaking of upgrading previously, make sure you do the migration scripts beforehand if you have a current server.

I make the mistake of upgrading and not migrating beforehand from 1.4 to 1.6 and couldn't find an version of 1.4 since all the archives were down and building from source on a VPS just wasn't happening. The Rethink team was amazing in their support of building the old version for me specifically, and if this is indicative of their dedication to user support can't wait to see this become a big success.


slava @ rethink here. Thanks for the vote of confidence, and sorry for the poor experience. The product is improving very quickly, and we decided to trade off format stability for speed of development until we hit 2.0 (at which point we'll freeze the formats and put more resources into backwards compatibility).


This latest release has actually prompted me to check it out. I had been meaning to but hadn't found an exact usecase for it yet, now I'm just going to try to build something with it, though I don't know what yet.

This kind of dedication to it is definitely what makes me want to check it out, you guys are doing some really great work on it.


10x insert performance improvement seems very small considering this simple benchmark someone did:

http://stackoverflow.com/questions/15151554/comparing-mongod...

  MongoDB: 0m0.618s
  RethinkDB: 2m2.502s
Although as stated in that post, this is likely because of the different fsync() policies between the two databases.


    /dev/null: 0m0.001s
/dev/null is clearly the best database.

(If your performance numbers are too good to be true, they might not be true.)


One aspect of micro-benchmarks that is most of the time ignored is that they reveal different default settings various systems come with. And unfortunately after seeing the results, not too many dig deeper to figure them out.


Indeed. And the default settings reveal something about a system's priorities. If you care about your users' data, then Mongo's default settings should be a giant red flag. The revealed priority is seeming fast.


In noSQL world, any DB's default settings should be a giant red flag. Don't get me started about HBase... I lost 2 hours of data this way. :(


That was actually one of the explicit design goals of Rethink -- pick defaults such that users never have to wonder about the safety of their data. I know the folks at Riak are also in this camp, so there are definitely NoSQL dbs that do this well.


I'm not sure Riak is a good example as it's another example of an extremely slow database.

There's also the angle that if it's to offer no performance benefit, perhaps a classic relational database will do it?

I'm curious to hear your thoughts about it.


As someone that uses hbase in production, I'd be interested in hearing the setting you are talking about.


Honestly, I've been very positively impressed with CouchDB so far.


How about "they should be an invitation to learn more about the system" :D? As far as I can tell, in many cases these defaults have almost never been reviewed by devs because most of the early users already knew what tweaks they needed.


I dunno. Personally I'm a big fan of the "sane default or no default" approach. I think that mongo's former approach is irresponsible engineering (I understand they have fixed it recently).


Well, on a PG 9.1.9 instance, which has fsync=on and an older Intel i7 on an Intel SSD, my 8-process program inserting a 120-byte TEXT value into a (SERIAL, TEXT) table, with transactions each doing 10 inserts, manages to finish inserting 100,000 groups of 10 "documents" in 37 seconds. So that's roughly 27,000 documents inserted per second (not sure if they are measuring individual inserts documents or groups of 10).

According to iostat, this does 5000 write transactions per second to the disk.

If you are logging throw-away data, you can use an UNLOGGED table. This brings it to 49,000 inserted rows per second.

If you don't care about your data integrity, and set fsync=off, it goes up to 65,000 and seems to be limited by CPU.


Very cool, thanks for the numbers! A few bits of info on the benchmark in the post.

It uses a single process/thread/connection. Adding more concurrent clients would improve performance, though there is still some scalability work to be done.

Ops/second in the graphs measures number of documents, not groups.

One fundamental aspect of the benchmark is that it uses random keys, which is significantly slower (in any database) than using sequential inserts. I'm not sure how you insert into Postgres wrt to keys.

I suspect postgres has better latency, and since the benchmark in the post is very much latency bound, this would account for the factor of four performance difference. It's something we still need to address. EDIT: 1 million documents / 37 seconds / 8 connections = 3378 ops/second, which is about a factor of four off from 800 ops/second that RethinkDB does on this. (This is obviously back of the napkin math.)

There's lots of work to be done, I'm very much looking forward to publishing authoritative, scientific comparison benchmarks.


For postgres I've been using several threads, each with its own connection, and all going through the same transaction (using the number exported by pg_export_snapshot()) - with this the speedup is quite big (c++ project using libpq).

I expect that if I move to binary format for inserting it might be even faster. I'm basing this since my reader is binary and that made x3-x4 improvements for my data set/computer/etc.


According to that page, the Mongo test was using unacknowledged writes, so not exactly a useful comparison


This is a bit misleading -- the performance difference between 1.6.1 and 1.7 ranges from 2x to >1000x depending on the circumstances (size of documents, durability, whether the client waits on network reply, etc.) This also doesn't account for 10-100x performance improvements on the batch insert algorithm made prior to 1.7.

I suspect that RethinkDB 1.7 would be comparable to Mongo on insert performance workload described in this stack overflow thread.


It was not deliberately misleading - it's simply the top result when doing a google search on rethinkdb and mongo performance. It's a shame that decent, general case database benchmarks are so hard to find.


Sorry, I meant to say that the benchmark in the release announcement is misleading (since it doesn't address all relevant issues), not the link in your comment.

We're working on getting authoritative benchmarks out, but unfortunately good benchmarks are extremely time consuming (much like any science experiment).


it would be great to see comparison with tokutek benchmark on mongodb / version with fractal tree indexes http://www.tokutek.com/2013/05/sysbench-benchmark-for-mongod...


Ok, now try doing 1000 inserts at once on Mongo. OH WAIT, global write lock.


Is there a plan to fix: "If the machine is acting as a master for any shards, the corresponding tables lose read and write availability (out-of-date reads remain possible as long as there are other replicas of the shards). ... Currently, RethinkDB does not automatically declare machines dead after a timeout. This can be done by the user, either in the web UI, manually via the command line, or by scripting the command line tools."

http://www.rethinkdb.com/docs/advanced-faq/#what-happens-whe...


I have been working on the Java driver for rethink for a bit now and I will tell you the DB seems like an awesome mix of Mongo and Riak, Mapping objects into the DB is super easy and doing datacenter aware operations is amazingly easy.


What's the state of that driver look like? It's literally the only thing holding me back from moving over today.


Right now you should have complete functionality, but Its not fully tested and there are some "convience" methods that need to be added to some of the ReQL classes to make it mimic the official API ( for example row is not a member function of connection yet so r.update(r.row("foo")) doesn't work yet )


dkhenry, still looking forward to the day I see it on Github and get to play with it! Can't wait to try it out.



That's phenomenal!


FYI-- for those looking to install on Ubuntu, we have a build problem with one of our dependencies for the web UI. We're rebuilding the packages now; this should take about half an hour on Launchpad. I'll update this comment when the new package is available.

Edit: Packages have been uploaded to Launchpad, waiting in queue to build. (12PM PST)


@atnnn confirms that Launchpad builds are now working and available-- Ubuntu packages are live.


This is a really nice update. Migrating data between releases with a giant Ruby script feels like a hack each time there's an update. Insert speed has been a real annoyance. Fetching multiple keys is really nice (instead of map/filter the entire thing). And expanding pluck() to nested documents makes so much sense (I was worried ReQL would be limited to manipulating top-level documents).

Overall, an exciting release. Going to upgrade and see what insert speed is like on my setup.


For those interested, my insert time for ~250 records (~9.2MB of JSON) went from 340 seconds to 20 seconds.

Unfortunately when migrating from 1.6.2 to 1.7.1 I lost a table (not sure how) and all secondary indexes :(


Scala driver for anyone https://github.com/kclay/rethink-scala, been fun working with rethink (its not updated to 1.7 changes yet)


For my curiosity, where is the insert bottleneck now?


We have to investigate to know for sure, but if I had to guess it's probably in the storage engine/disk IO.


Surely your SSD has more than 800 IOPS though. Have you done any profiling?


Sorry, I misunderstood your question. In this specific benchmark the bottleneck is in the network and disk latency (since the benchmark sends out a batch of writes, waits for a server acknowledgement, which in hard durability mode means waiting on disk), and then sends out the next batch.

When we use a benchmark that doesn't bottleneck on latency (by adding more concurrent clients, or by using noreply) the ops throughput approaches theoretical IOPS throughput of the SSD.


Does anyone know the status of PHP drivers for RethinkDB? I'm surprised there aren't official drivers out yet.


There is a pretty well maintained community supported PHP driver -- https://github.com/danielmewes/php-rql.


It's cool to see atomic operations. Any plans to implement multi-document ACID transactions? Just like all_or_nothing in pre 0.9 couchdb when in case of conflict update was being rejected.


Are you guys aware of any rethinkdb cloud hosts that are starting up?


Check out https://www.rethinkdbcloud.com. (I don't know about their current state, but it seems like an interesting service)


I'm the main developer for RethinkDB Cloud. It's pretty rough at the moment. You can expect 1.7 instances available in the over the weekend, along with smaller shared instances, and lots of various updates. If there are any questions or if anyone wants to be apart of the Heroku add-on testing you can contact me at cam@rethinkdbcloud.com .


Can anyone compare/contrast RethinkDB to CouchDB and MongoDB?


You can check out the technical comparison (http://rethinkdb.com/docs/comparisons/mongodb/), and a more biased comparison overview (http://rethinkdb.com/blog/mongodb-biased-comparison/). Hope this helps!


While both CouchDB and RethinkDB store JSON, the differences between them are more radical. I cannot post an as-extensive comparison as the one with MongoDB, but here are some aspects.

Please keep in mind that this is not an authoritative comparison and it may contain mistakes. Plus as for many such systems, the aspects covered are in reality not that easy to be described in just a few words.

Platforms:

- RethinkDB: Linux, OS X

- CouchDB: where Erlang VM is supported

Data model: - both JSON

Data access:

- RethinkDB: Unified chainable dynamic query language

- CouchDB: key-value, incremental map/reduce

Javascript integration:

- RethinkDB: V8 engine; JS expressions are allowed pretty much anywhere in the RQL

- CouchDB: Spindermonkey (?); incremental map/reduce, views are JS-based

Access languages:

- RethinkDB: Protocol Buffers

- CouchDB: HTTP

Indexing:

- RethinkDB: Multiple types of indexes (primary key, compound, secondary, arbitrarily computed)

- CouchDB: incremental indexes based on view functions

Sharding:

- RethinkDB: Guided range-based sharding (supervised/guided/advised/trained)

- CouchDB: -

Replication:

- RethinkDB - sync and async replication

- CouchDB - bi-directional replication can be set between multiple CouchDB servers

Multi-datancenter:

- RethinkDB - Multiple DC support with per-datacenter replication and write acknowledgements

- CouchDB - (?)

MapReduce:

- RethinkDB: Multiple MapReduce functions executing ReQL or Javascript operations

- CouchDB: views are map/reduce but they need to be pre-defined

Consistency model:

- RethinkDB: Immediate/strong consistency with support for out of date reads

- CouchDB: http://guide.couchdb.org/draft/consistency.html

Atomicity:

- both document level

Durability:

- both durable

Storage engine:

- RethinkDB: Log-structured B-tree serialization with incremental, fully concurrent garbage compactor

- CouchDB: B-tree

Query distribution engine:

- RethinkDB: Transparent routing, distributed and parallelized

- CouchDB: none

Caching engine:

- RethinkDB: Custom per-table configurable B-tree aware caching

- CouchDB: none (?)


Are there any plans for Couch-style incrementally-computed aggregates/views in RethinkDB?


Considering RethinkDB's secondary indexes can be defined around pretty complex ReQL expressions [1] you could already get some of it already.

[1] http://rethinkdb.com/docs/pragmatic-faq/#how-do-i-take-advan...


Yeah, I thought about that, but it seems like you can only use them to get the "map" part of map/reduce... no aggregation. Unless I'm missing something.


> Yeah, I thought about that, but it seems like you can only use them to get the "map" part of map/reduce... no aggregation. Unless I'm missing something.

I think that's correct.


Yes. RethinkDB is really well set up to do this due to the underlying parallelized map/reduce infrastructure. This feature is a matter of scheduling priorities. I don't have an ETA yet, but it will almost certainly get done in the medium-term.


Awesome. If there's a Github issue about it I'd be interested to follow it.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: