Reading through some of their docs, I like their honesty about where RethinkDB is and isn't. Reading the vs Mongodb they have:
"RethinkDB's performance has degraded significantly after the addition of the clustering layer, but we hope we'll be able to restore it over the next several releases."
They could have tried to gloss over that. I like that they weren't afraid to put that out there.
Thanks! Just FYI, this statement is now seriously out of date. We've done an enormous amount of performance work since the comparison document was written, and performance is now back to normal. Once we get some scientific graphs to back up anecdotal evidence, we'll update this part of the doc.
I have been using RethinkDB in production with clustering.
One thing I would really love to see more time put into is integration of a real consensus protocol like Paxos or Raft for properly recovering from failure.
Currently tables will become unwritable after a single machine fails and you need to update the blueprint/semilattice to use a new master manually. This is error prone (due to race conditions in replication of blueprint updates that can cause vector clock conflicts if you update it from more than one place) and generally not all that fun.
Error handling in these cases could also be better, the initscript it ships with doesn't make this much better (would be nicer if it used a symlink hack or something similar to have different initscripts for each db instance).
Currently it's hard to tell without manual inspection why a RethinkDB server crashed or didn't start without manual intervention.
That being said, I love using RethinkDB, ReQL is great - infact it's so good that it's generally justification enough for me to use it for new projects.
If clustering could be given more love it would be my #1 datastore for all projects probably. (See the Jepsen by aphyr series for an idea of what I am looking for here)
I'm happy you brought up clustering. Internally we've been quite frustrated with this part of the product, but until a few months ago we held off the development on it for two reasons: we wanted to collect more information from users on real use cases and behavior, and there were more immediate bottlenecks in the product.
We restarted heavy development on the clustering infrastructure two months ago, and just yesterday I played with the prototype of the first upcoming upgrade. It's a WIP but is absolutely delightful (you can see my tongue in cheek review of it at https://github.com/rethinkdb/rethinkdb/issues/2957).
Here are the parts that are already done and will be shipped soon:
- Vector clocks conflicts are now resolved automatically, no
more manual conflict resolution
- There is now a ReQL API for clustering that's dramatically
better than the current `rethinkdb admin` tool
- Much love has been put into presenting the abstractions to
users. Everything is cleaned up and simplified, it's easier
to understand and change, and even in advanced cases you
won't have to know anything about blueprints/semilattices.
- Really, this is about to get dramatically better. I can't
summarize it in a bullet point, we put an enormous amount of
effort into this in a thousand different places.
Here's what's coming immediately after that:
- Automatic failover
- Always-on resharding (no more resharding downtime)
(The reason why these latter updates are coming after the API overhaul is because they require a lot of simplification/refactoring/redesign internally as well as externally, and we wanted to do it piecemeal).
Thanks for writing up your feedback and sticking with RethinkDB despite the limitations of clustering 1.x. Multiple people are currently working very hard on this, and things are about to get a lot better.
This makes me very happy to hear. I've been using RethinkDB for a few months now and I love it, but the manual failover has made me a bit uneasy.
Any chance of a clue as to the kind of timeframe for this? Even a very rough idea would be fine as I can appreciate you might not want to (or be able to) commit to anything yet.
I think we'll be able to ship the new clustering API in ~two months (note, it's a huge and a massively delightful change). I'm hoping we'll be able to get failover out two months after that, but it's hard to give precise estimates looking that far out.
If you have any questions about the release, product roadmap, or anything RethinkDB-related, please ask! I'll be around all day to answer questions and incorporate your comments into the development roadmap.
Hi Slava, I just want to say thanks to you and your team for the huge amount of work I know went into having binary data storage make this release.
I followed the relevant issues on GitHub closely, and for what it's worth I think you made the right choice - certainly this fits the way I want to store binary data in the database. It's a better system than even CouchDB attachments, what I'm currently using[0], and so much better than GridFS, a great solution to the wrong problem of how to engineer a storage layer on top of MongoDB, when my actual problem is: "how do I store this avatar image alongside this user as simply as possible."
Knowing how much attention was paid to "getting it right" on this one feature I paid particular attention to significantly bolsters my faith[1] in the engineering quality of the project as a whole.
[0] and what ultimately made me chose Couch over Rethink for my current project
Thanks! There is a lot of pressure to ship quickly, which doesn't necessarily work well for infrastructure systems. It's really wonderful to see people recognize that getting things right is also important.
FYI, the filesystem feature is still on the horizon. The current implementation is good for small files like avatars, the upcoming one will be good for large files like videos. The distributed FS will almost certainly be use `r.binary` as a building block, which will result in a neatly modularized design overall.
The feature hasn't been specced out yet, so to be entirely honest I don't know. We'll go through our usual process -- we'll think about this problem hard, talk to users and find specific use cases, and see what others are doing in the field. Most of the time it produces something really pleasant (occasionally it doesn't, and we go back and fix it).
I should have a better idea of what the comparison looks like after we get a little further along.
Do you have (either already or in the works) any plans for supporting some sort of solution for in-browser databases, like either PouchDB or miniMongo?
We've been kicking around some R&D projects around this, but there are no concrete plans to get a production feature out yet.
Here is how the R&D projects have worked thus far. We implemented a RethinkDB protocol compatible server in JavaScript (which is surprisingly easy), so that a lightweight version of the server can run in the browser. If you use the JS driver in the browser with this server, all queries continue to just work.
Behind the scenes the in-browser server can sync up data with the real server, so everything can work while the browser/mobile device goes offline.
It's not too hard to build a prototype (and we've seen a few), but getting it to production involves a non-trivial effort; plus we'd have to support the feature indefinitely which would add a significant development cost.
Lots of people have asked for this, so I'd like to find a way to make it happen. @neumino has a version of this in the works right now, so we might just polish it up and release it to the community.
We've been working really hard to keep the surface area of the product minimal so the dev team can stay nimble, which so far allowed us to ship features and stability improvements very quickly. We have a lot of really exciting updates planned for the next few releases, so we'd like to stick to the current model for now. After that, we'll start expanding the dev team and bringing some of the community drivers under the official RethinkDB umbrella. I don't have an ETA for this yet, but I expect this will happen within a year or so.
I've been using https://github.com/bitemyapp/revise (Clojure Driver) for a few months and its pretty good. Its two versions behind right now, however, though work is slowly going on to bring it back up to date.
Arghh, sorry. I haven't used Java in years, so all JVM-related languages/drivers are kind of mashed together in a single bucket in my mind. I really ought to play with scala/clojure to unmash them.
Thanks for being a user! If you run into any issues or have feature requests, please post on https://github.com/rethinkdb/rethinkdb/issues (we're planning 1.15 now and can still reshuffle priorities!)
Even with Docker, the upgrade process wasn't that painful. I just used the Dockerfile from here [1] to create a new xxx/rethinkdb:1.14 image, then booted it up and everything worked automagically. I did have to rebuild secondary indexes, but no biggie, it was a simple `rethinkdb index-rebuild -c host:port -r db.table` away (after I upgraded the brew rethinkdb package). Hopefully it will get even better in the future.
FYI, you don't need to rebuild the indexes. You can continue running your app and the old index protocol will work seamlessly. It helps to rebuild when you upgrade in case you start using new functionality (so the indexes get computed correctly), but running an old system would work fine.
EDIT: note that you won't necessarily be able to run with indexes from multiple versions back. We'll almost certainly prune the backwards compatibility code to only a few releases to keep the codebase clean and nimble, and to keep difficult to diagnose bugs to a minimum. But it's still a very convenient feature release-to-release, as you can upgrade, and then rebuild indexes at your convenience.
I love rethink and ReQL and especially the intuitive Python bindings - but I can't wait for geospatial indexing. It's the last thing preventing me from using RethinkDB in prod.
Good news on that front: geospatial support has already passed code review and been merged into the codebase (check out this issue to learn about the implementation and progress: https://github.com/rethinkdb/rethinkdb/issues/2571).
There are a few limitations we'd like to work out (e.g. right now the implementation doesn't support compound geo indexes), but we're well on track to shipping 1.15.
I'm very interested in looking into the code as a sample of real-world distributed system implementation, much more informative than our university's distributed systems course, I guess.
Are there any guide available for the source code, besides src/README ?
Unfortunately, there isn't a good guide yet (but hopefully there will be one soon).
Studying RethinkDB source code may not be the best way to study distributed systems, though. There is an enormous gap between a working system and a production quality product, and most of the code in that gap has to do with relatively mundane issues like error checking/error reporting, monitoring tools/APIs, lots of polish, edge case handling, etc.
It's a lot of fun to get into the guts of the system, but it's fairly large, so it's a non-trivial undertaking. If you do decide to do it, we'd love to help you out on IRC (#rethinkdb on freenode), and would appreciate if you documented your experience so we could make the process easier for others in the community!
Wow - people have so much positivity in sharing their experience with RethinkDB. Congrats to the team for building something great, that others find utility in and enjoy working with.
I'm a fan of RethinkDB, however I didn't had a chance to use early releases. I'm glad to see so much positivity. Can you share your thoughts and production workload?
There isn't much of shifting of focus. The query language can do both analytics and realtime queries. We optimize for realtime, but with most of the performance work, optimizations tend to apply to both use cases. Everything is getting better, but the focus is still on realtime.
"RethinkDB's performance has degraded significantly after the addition of the clustering layer, but we hope we'll be able to restore it over the next several releases."
They could have tried to gloss over that. I like that they weren't afraid to put that out there.