My company uses MongoDB. Our biggest pain points are:
1. MongoDB has massive storage overhead per field due to the BSON format. Even if you use single character field names, you're still looking at space wasted on null terminators. 32bit fixed length int32s also bloat your storage use. We solve this by serializing our objects as binary blobs into the DB, and only using extra fields when we need an index.
2. In Mongo, the entire DB eventually gets paged into memory and relies on the OS paging system which murders performance. For a humongous DB, not so much.
3. #1 and #2 force #3, which is sharding. MongoDB requires deploying a "config cluster" - 3 additional instances to manage sharding (annoying that the nodes themselves cannot manage this, and expensive from an ops/cost standpoint).
What I would like to know is:
1. What is the storage overhead per field of a document in RethinkDB? If it's greater than 1 byte, I'm wary.
1. In the coming release we'll be storing documents on disk via protocol buffers, which, unlike BSON has an extremely low overhead on fields. A few releases after that we'll be able to do much better via compression of attribute name information (though this feature isn't specced yet).
2. No ETA yet, but we're about to publish an updated, better document, better architected client-driver to server API spec, so we'll be seeing many more drivers soon.
If you use proto-bufs, it means you already have a system for internal auto-schematization. Why not pack all the fields together and use a bit-vector header to signify which fields are present and which fields have default values? I'd LOVE to see a document DB with ~1 bit overhead per field.
Yes, that's pretty much what we're going to do. It's a bit hard to guarantee everything in a fully concurrent, sharded environment so it'll take a bit of time, but that's basically the plan.
10gen have been thinking about compression but nothing specific has happened yet (https://jira.mongodb.org/browse/SERVER-164). ZFS + compression is interesting, but not 'production' quality if you're using linux, and last time I tried to get MongoDB running on Solaris I gave up...
The issue has been open for over two and a half years, is one of the most highly voted issues, and has yet to even have reached active engineering status.
Agree with you that compression is just a workaround for the awful BSON format.
I started using RethinkDB in one of my projects and am looking for excuses to use it in more of them. So far things have been great and honestly my impression is that RethinkDB doesn't get nearly the hype it deserves.
I used Mongo before and it is fine db and I don't think I would be sad to use it, however rethink really does so many things better.
Again I just started using it and things are really good, I didn't ran into any obvious limitations and annoyances.
There are several features that I really like, for example: web admin is really well done, it is easy and obvious how you create cluster, there are a lot of small things that made me jumpstart my development faster, as I can run queries in admin to try them out and I also get data back to see how things will look like.
The only thing I am somewhat missing is 'brew install rethinkdb'
Thank you :), again it is not really serious gripe as regular installer does the job just fine, but I tend to install all apps like that and this helps.
+1 for sure for this one
btw, when I was installing it, I tried that even before checking on homepage :)
I would like to add (not to edit original post one more time), that in one of those things I like is query language. I remember how it felt weird initially doing simple things in Mongo, language is very simple and queries make sense and are easy to read.
It does indeed look very much like MongoDB, but made by people that actually know what they're doing. It's refreshing to see good database design for a change.
MongoDB IS durable now by default, has a third party MVCC implementation (MongoMVCC) and has pretty decent admin tools.
And this idea that joins is a requirement for a "serious" database makes absolutely no sense. Database level joins are toxic for scalability and IMHO should always be done in the application layer.
Toxic seems like a strong word to describe standard relational database functionality. Are you seriously recommending that join functionality always be done in the application layer? If so, are you speaking from specific experience, and can you elaborate on your reasoning?
I've seen too many poor re-implementations of relational database functionality in the application layer to ever recommend it as a standard starting point. Doesn't the concept of not prematurely optimizing apply here? Solve the scalability problem when you need to. That may mean moving some join functionality into the application layer, but the solution to any given scalability problem depends on the specifics of the problem. Just throwing out database joins as a rule seems drastic.
Mongo in its most durable mode (which btw isn't what, say, Postres would call durable) is really slow. Why even bother with it anymore?
First party MVCC is the only one that matters. It affects vital things like backups, analytical queries and transactions.
Joins are extremely useful. If a database does the sharding, it is almost always better for it to do the joins as well. Performance can be good with the right model, and Mongo is slow anyway.
So we are in agreeance then. MongoDB IS durable but it will be slower doing so. Hardly a surprise there. And still have to disagree about the joins but hey agree to disagree.
As for MongoDB performance well making a blanket statement is pretty silly. On a previous project I had queries that were upwards of 40x faster in MongoDB than MySQL. Why ? Because MongoDB allows the ability to embed documents within other documents to the point where I could make a single query with zero joins to fetch 20 entities worth of data.
Every database is optimal for different use cases.
MongoDB has been durable for a while with journalling. They're only just enabled safe mode (i.e. synchronous) for the clients by default, but this is something different to being durable.
Database level joins are toxic for scalability and IMHO should always be done in the application layer.
Not having database level joins is toxic for scalability for so many reasons.
MongoDB reminds me of MySQL : The Early Years. When every ignorant design decision and missing functionality was somehow actually a benefit. Then it gained them and most nervously smiled and moved on.
Most of the posts I've seen about RethinkDB focus on "hey, we're a better NoSQL solution than MongoDB." That could be true, but so far I see it mostly coming from RethinkDB themselves, or people who like the design in theory.
However, does anyone have any practical real-world experience using it? It's not production ready (from what I gather), but has anybody actually used it for real world stuff?
For my own part, I tried it out, and got stuck trying to implement a many-to-many style join. I did some searching, and it looks like that is not really possible at this point. Not a bit deal, but it might be handy to have some example SQL-to-RethinkDB queries, just to help us newbies figure out the ropes.
> However, does anyone have any practical real-world experience using it? It's not production ready (from what I gather), but has anybody actually used it for real world stuff?
There are people that told us they are experimenting with it for real apps, they've sent extensive feedback (most of these can be seen on GitHub), and some have started to build libraries for RethinkDB. As with anything young and open source, it's difficult though to tell with certainty how many projects are using a tool and what stage are they.
> it might be handy to have some example SQL-to-RethinkDB queries, just to help us newbies figure out the ropes.
Hi, slava @ rethink here. People have been using Rethink for lots of projects, but there are still some showstoppers we have to work out (better docs notwithstanding) -- remember, the product has been in the wild for less than 90 days.
What did you try to do with a many-to-many join? We could help you with writing the query, and could add syntax sugar to the language to make it easier if it makes sense.
> "remember, the product has been in the wild for less than 90 days."
I hear you, and, FWIW, I'm excited about Rethink. To rephrase my question/observation: your article clearly lays out why you think it is better than MongoDB, using some quotes from people who agree with you. However, without some real-world data, it is still an argument rooted in theory. I like theory, but I also like to take real-world data to my bosses. Do you have any stats/examples that actual compare and demonstrate the performance? (I understand that wasn't the purpose of your article, just asking as a follow up).
Regarding the many-to-many joins: I was just playing around with a contrived example: "a blog post has and belongs to many categories." Mainly I was just curious how to do it, I didn't _need_ it for anything. But, I couldn't figure out how to write it with the query language. I was using Python DSL, FWIW.
Riak is NOT operations-oriented. It's nearly impossible to manage operationally without dedicated staff at scale and the tools to introspect and analyze and deal with failures aren't robust enough yet.
I know they're just trying to contrast Riak and Cassandra with Couch and Mongo, and that Riak is designed to shard easily without the developer having to think about it.
That philosophy actually is "developer-oriented" in that it SEEMS like an operational savings because it was designed by developers.
Saying Riak is categorically non-operations-oriented is a bit hyperbolic, but I will be the first to acknowledge that we need even more visibility into failure-recovery / degraded mode situations. I've spoken to a few customers who have "cheat sheets" of Erlang console commands they use to debug things like handoff slowness or poor performance in general. This alone means we need to do better,
On the other hand, Riak continues to function in scenarios where other databases would be completely unavailable. I'll take immature visibility during those situations over complete unavailability any day,
I appreciate your feedback - I can assure you that this is something we're constantly working on and you'll see improvements with each release.
Finally, if you've been bitten by anything specific you'd like to see fixed, we do all our development in the open at http://github.com/basho, so github issues, pull requests, etc go right into our internal tools and workflows.
Can you provide an example of where Riak was "nearly impossible to manage operationally without dedicated staff at scale and the tools to introspect and analyze and deal with failures aren't robust enough yet"?
Just watch the talk by Voxer on Riak (on Basho's site). It's basically an hour long explanation of why the things that come out of the box with Riak don't work for them. We ran into those problems as well.
Also, I'm giving a talk on it in 9 days at ErlangDC...Whisper is now top 10 social networking apps and we had a number of critical Riak failures. i'll be elaborating on them, though the focus of the talk is not to bash Riak, no pun intended... just to provide our experience and how we worked around it.
That's an interesting take -- I think the Basho team positions Riak as an operations-oriented system (at least that's how I always thought about it, though I haven't used Riak in production).
Sorry, we added limitations a few hours after posting. I should have been more clear about the fact that they were edited in. I'll try to get better at this live blogging thing :)
Please keep talking about limitations in your marketting, even as the product gets more mature and has fewer and/or different ones. (There's _always_ limitations). Learn from the mongodb backlash.
Every NoSQL database is perfect and better than all the other options until you start using it in the real world. I'm not saying rethink DB is not a good solution, the point is, nosql dbs are about compromise and specific problems.
I agree. I think NoSQL began exploding when people started doing more custom work and pushing further away from big orm based frameworks because of NoSQL's relaxed schemas (or lack thereof) making it easier on developers. I still really like NoSQL for those reasons, but like everyone knows, in the real world there are a lot of gotchas, which seem to come from the designers trying to solve one purpose (many specific DBs) which makes sense considering the yet uncharted territory of NoSQL (proof of concepts). I'm excited to see how RethinkDB fits into the real world though. It looks as though someone went back on years of research across multiple experiments and took what was right and built it. But that said, just because one shirt is really nice and a pair of pants fits amazing doesn't mean they match, and that's why I agree with you.
Thank you. I tell people I've stopped because I don't have the time. Really it's because I've since realized that the world is much more nuanced than I gave it credit for, but I have nowhere near the sufficient writing skill to express it. I'll try to get back into it.
I wasn't criticizing rethinkDB, only the "marketing speak" or excessive excitement that normally comes with every new nosql solution.
Anyway, I wish I had the skills to create such a complex piece of software, congrats =)
Looks very interesting, but this statement in their FAQ is a red flag for me:
How can I understand the performance of slow queries?
Understanding query performance currently requires a pretty deep understanding of the system. For the moment, the easiest way to get an idea of why your query isn't performing well is to ask us.
Wish RethinkDB was a little further along because it seems like it might be a good fit for a new service I'm building.
You mention comparing against Hadoop for computationally-intensive data analysis. Would Rethink be suitable for a several-terabyte dataset with non-computationally-intensive analytics?
Currently we're using Hive and Python over streaming Hadoop. There's no significant ongoing data accumulation; we're just analyzing the data we have.
We haven't tested on workloads like that, but I can't think of anything that would prevent this workload from working well. The idea behind Rethink's architecture is to eventually allow people to run full analytics on the same cluster as their live app. Currently we're optimizing for online-type queries, but you can run analytics queries too, it's just that we haven't given the optimizer enough love on that front yet.
How's the general performance and memory consumption on smaller machines, e.g. entry-level VPS's or the lower spectrum of AWS VMs? Don't have any big projects in the pipeline that immediately required sharding etc, but would like to play with it on a few weekend-scale items.
This is a known limitation -- it will be resolved when we fix https://github.com/rethinkdb/rethinkdb/issues/207. We just picked the safest default (fsync per op) and shipped the product. #207 will make things a little smarter.
this is marketing cloaked in a developer portal. I think it's great that rethinkdb is trying to distinguish themselves from Mongo, but what's the real marginal utility of a rethinkdb over Mongo?
Mongo has been around for years, and it still has problems.
Rethinkdb is just launching a new product that essentially does the same thing as Mongo, but is maybe just a little easier to use.
I think the Yet Another Database (YAD) question still hasn't been answered by this post.
We have been looking at NewSQL(or even NoSQL) platforms for our databases at my place of work, and we also stumbled upon RethinkDB. While everything these guys say does sound amazing, we were looking for someone who has implemented it, or any third party case study about it. Since we couldn't find any, we decided not to go with RethinkDB for now.
Does anyone here know any big website/service which uses RethinkDB?
«An asynchronous, event-driven architecture based on
highly optimized coroutine code scales across multiple
cores and processors, network cards, and storage systems.»
It may be a dumb question, but isn't this statement a bit contradictory? As far as I understand, event-driven design and coroutines (i.e. cooperative multitasking, lighweight threads, etc.) are the techniques usually chosen to AVOID concurrency.
How does such a design imply multicore scalability? Obviously, coroutines and event loops don't prevent you from running in multiple cores. I just fail to see the correlation.
This is a great question. We start a thread per core, and multiplex thousands of coroutines/events on each thread. When coroutines on different threads need to communicate, we send a message via a highly optimized message bus, so cross-thread communication code is localized. This means each thread is lock-free (i.e. when a coroutine needs to communicate with another coroutine, it sends a message and yields, so the CPU core can process other pending tasks). The code isn't wait-free -- a coroutine might have to wait, but it never ever locks the CPU core itself. So, as long as there is more work to do, the CPU will always be able to do it.
If instead we used threads + locking like traditional systems, we'd have to deal with "hot locks" that block out entire cores. Effectively we solved this problem once and for all, while systems that use threads + locks (like the linux kernel) have to continuously solve it by making sure locks are extremely granular.
We do effectively have an ad-hoc mini Erlang runtime that we wrote at the core of the system. I'm not sure how deliberate that was -- we sort of borrowed performance ideas from many places, tried a lot of different approaches, and settled on this one. Lots of this was definitely inspired by ideas from Erlang.
There definitely seems to be a version of Greenspun's Tenth Rule for Erlang. But I think Greenspunning has gotten too bad a name – sometimes implementing a subset of a classical system is exactly what you ought to do, for example when your problem allows you to exploit certain invariants that don't hold in the general case, or for some reason using the classical system itself (Erlang in this case) is not an option.
Right! Rethink has an adhoc Erlang runtime for message processing, and an adhoc lisp for the query language. I'm both ashamed and proud of this at the same time :)
You get less overhead by using less often the low-level concurrency primitives that involve cross-core synchronization. Cross-core synchronization happens in rethinkdb mainly when you see an on_thread_t object constructed or destroyed (and also in a few other places) and those get batched when you have more than one per event loop (which is not necessarily good, inflated queue sizes is also something to be wary of). So if you want to attach a bunch of network cards and high-speed storage devices on opposite ends of a handful of CPUs, your throughput won't be hindered by the fact that millions of threads are trying to talk to one another.
Anyone know if there is a Lua driver for RethinkDB in the works? I suppose I could use the C client and think about generating one for Lua, but maybe someone has already done that?
There isn't one yet. We'll publishing the new, much simpler spec for client driver writers. It'd be pretty easy to do a native Lua driver (based on protocol buffers).
Other than RethinkDB, BigCouch looks both Developer/Operation oriented database since it is a Dynamo-like CouchDB. Does anyone have a BigCouch experience?
1. MongoDB has massive storage overhead per field due to the BSON format. Even if you use single character field names, you're still looking at space wasted on null terminators. 32bit fixed length int32s also bloat your storage use. We solve this by serializing our objects as binary blobs into the DB, and only using extra fields when we need an index.
2. In Mongo, the entire DB eventually gets paged into memory and relies on the OS paging system which murders performance. For a humongous DB, not so much.
3. #1 and #2 force #3, which is sharding. MongoDB requires deploying a "config cluster" - 3 additional instances to manage sharding (annoying that the nodes themselves cannot manage this, and expensive from an ops/cost standpoint).
What I would like to know is:
1. What is the storage overhead per field of a document in RethinkDB? If it's greater than 1 byte, I'm wary.
2. Where is the .Net driver?