Hacker News new | past | comments | ask | show | jobs | submit login
Our take on RethinkDB vs. MongoDB (rethinkdb.com)
116 points by coffeemug on Jan 31, 2013 | hide | past | favorite | 83 comments



My company uses MongoDB. Our biggest pain points are:

1. MongoDB has massive storage overhead per field due to the BSON format. Even if you use single character field names, you're still looking at space wasted on null terminators. 32bit fixed length int32s also bloat your storage use. We solve this by serializing our objects as binary blobs into the DB, and only using extra fields when we need an index.

2. In Mongo, the entire DB eventually gets paged into memory and relies on the OS paging system which murders performance. For a humongous DB, not so much.

3. #1 and #2 force #3, which is sharding. MongoDB requires deploying a "config cluster" - 3 additional instances to manage sharding (annoying that the nodes themselves cannot manage this, and expensive from an ops/cost standpoint).

What I would like to know is:

1. What is the storage overhead per field of a document in RethinkDB? If it's greater than 1 byte, I'm wary.

2. Where is the .Net driver?


1. In the coming release we'll be storing documents on disk via protocol buffers, which, unlike BSON has an extremely low overhead on fields. A few releases after that we'll be able to do much better via compression of attribute name information (though this feature isn't specced yet).

2. No ETA yet, but we're about to publish an updated, better document, better architected client-driver to server API spec, so we'll be seeing many more drivers soon.


If you use proto-bufs, it means you already have a system for internal auto-schematization. Why not pack all the fields together and use a bit-vector header to signify which fields are present and which fields have default values? I'd LOVE to see a document DB with ~1 bit overhead per field.


Yes, that's pretty much what we're going to do. It's a bit hard to guarantee everything in a fully concurrent, sharded environment so it'll take a bit of time, but that's basically the plan.


#1 - you might find it's not just the per field overhead, but the per document one. Check out the powerOf2Sizes settings;

http://docs.mongodb.org/manual/reference/command/collMod/#us...

10gen have been thinking about compression but nothing specific has happened yet (https://jira.mongodb.org/browse/SERVER-164). ZFS + compression is interesting, but not 'production' quality if you're using linux, and last time I tried to get MongoDB running on Solaris I gave up...


What really bugs me about Mongo:

https://jira.mongodb.org/browse/SERVER-863

The issue has been open for over two and a half years, is one of the most highly voted issues, and has yet to even have reached active engineering status.

Agree with you that compression is just a workaround for the awful BSON format.



I meant for RethinkDB (BTW - I've been a contributor for the MongoDB driver).


I started using RethinkDB in one of my projects and am looking for excuses to use it in more of them. So far things have been great and honestly my impression is that RethinkDB doesn't get nearly the hype it deserves.

I used Mongo before and it is fine db and I don't think I would be sad to use it, however rethink really does so many things better.

Again I just started using it and things are really good, I didn't ran into any obvious limitations and annoyances.

There are several features that I really like, for example: web admin is really well done, it is easy and obvious how you create cluster, there are a lot of small things that made me jumpstart my development faster, as I can run queries in admin to try them out and I also get data back to see how things will look like.

The only thing I am somewhat missing is 'brew install rethinkdb'


Actually, 'brew install rethinkdb' is now available. I'll update the install page with it in a moment.

EDIT: https://github.com/rethinkdb/rethinkdb/issues/269


Thank you :), again it is not really serious gripe as regular installer does the job just fine, but I tend to install all apps like that and this helps.

+1 for sure for this one

btw, when I was installing it, I tried that even before checking on homepage :)


I would like to add (not to edit original post one more time), that in one of those things I like is query language. I remember how it felt weird initially doing simple things in Mongo, language is very simple and queries make sense and are easy to read.


I know it's not formal for a comment like this, but: +1 for 'brew install rethinkdb'


Why brew instead of port? I've always wondered why the pendulum swings between ports and homebrew so often on HN ..


I don't think that pendulum has swung to the MacPorts side for many years.


It does indeed look very much like MongoDB, but made by people that actually know what they're doing. It's refreshing to see good database design for a change.


Can you expand a little bit on this? What design decisions in MongoDB vs RethinkDB are you referring to?


Rethink has durability, MVCC, joins, logical sharding, excellent admin tools, etc. All things that serious databases tend to have and Mongo doesn't.


MongoDB IS durable now by default, has a third party MVCC implementation (MongoMVCC) and has pretty decent admin tools.

And this idea that joins is a requirement for a "serious" database makes absolutely no sense. Database level joins are toxic for scalability and IMHO should always be done in the application layer.


Toxic seems like a strong word to describe standard relational database functionality. Are you seriously recommending that join functionality always be done in the application layer? If so, are you speaking from specific experience, and can you elaborate on your reasoning?

I've seen too many poor re-implementations of relational database functionality in the application layer to ever recommend it as a standard starting point. Doesn't the concept of not prematurely optimizing apply here? Solve the scalability problem when you need to. That may mean moving some join functionality into the application layer, but the solution to any given scalability problem depends on the specifics of the problem. Just throwing out database joins as a rule seems drastic.


Mongo in its most durable mode (which btw isn't what, say, Postres would call durable) is really slow. Why even bother with it anymore?

First party MVCC is the only one that matters. It affects vital things like backups, analytical queries and transactions.

Joins are extremely useful. If a database does the sharding, it is almost always better for it to do the joins as well. Performance can be good with the right model, and Mongo is slow anyway.


So we are in agreeance then. MongoDB IS durable but it will be slower doing so. Hardly a surprise there. And still have to disagree about the joins but hey agree to disagree.

As for MongoDB performance well making a blanket statement is pretty silly. On a previous project I had queries that were upwards of 40x faster in MongoDB than MySQL. Why ? Because MongoDB allows the ability to embed documents within other documents to the point where I could make a single query with zero joins to fetch 20 entities worth of data.

Every database is optimal for different use cases.


I wonder what sort of performance you would have gotten using MySQL or PostgreSQL, but denormalizing your data into JSON.


MongoDB has been durable for a while with journalling. They're only just enabled safe mode (i.e. synchronous) for the clients by default, but this is something different to being durable.

If you want a durable write; you should not disable journalling and use safe mode / getlasterror with the desired writeconcern setting - http://docs.mongodb.org/manual/reference/command/getLastErro...


> If you want a durable write; you should not disable journalling and use safe mode / getlasterror with the desired writeconcern setting

Sure. Which is the default approach of almost all of the drivers.


Not quite true; historically most (i.e. official) defaulted to safe=false.

Also safe=true only makes sure the server acknowledged your write; writeconcern allows you to wait for it to be written to the journal or more.

Also, journalling is not controllable via client drivers, only via startup flags / config options.


Database level joins are toxic for scalability and IMHO should always be done in the application layer.

Not having database level joins is toxic for scalability for so many reasons.

MongoDB reminds me of MySQL : The Early Years. When every ignorant design decision and missing functionality was somehow actually a benefit. Then it gained them and most nervously smiled and moved on.


Database level joins are toxic for scalability

Tell that to Teradata.


Most of the posts I've seen about RethinkDB focus on "hey, we're a better NoSQL solution than MongoDB." That could be true, but so far I see it mostly coming from RethinkDB themselves, or people who like the design in theory.

However, does anyone have any practical real-world experience using it? It's not production ready (from what I gather), but has anybody actually used it for real world stuff?

For my own part, I tried it out, and got stuck trying to implement a many-to-many style join. I did some searching, and it looks like that is not really possible at this point. Not a bit deal, but it might be handy to have some example SQL-to-RethinkDB queries, just to help us newbies figure out the ropes.


> However, does anyone have any practical real-world experience using it? It's not production ready (from what I gather), but has anybody actually used it for real world stuff?

There are people that told us they are experimenting with it for real apps, they've sent extensive feedback (most of these can be seen on GitHub), and some have started to build libraries for RethinkDB. As with anything young and open source, it's difficult though to tell with certainty how many projects are using a tool and what stage are they.

> it might be handy to have some example SQL-to-RethinkDB queries, just to help us newbies figure out the ropes.

Working on it already.

alex @ rethinkdb


Hi, slava @ rethink here. People have been using Rethink for lots of projects, but there are still some showstoppers we have to work out (better docs notwithstanding) -- remember, the product has been in the wild for less than 90 days.

What did you try to do with a many-to-many join? We could help you with writing the query, and could add syntax sugar to the language to make it easier if it makes sense.


> "remember, the product has been in the wild for less than 90 days."

I hear you, and, FWIW, I'm excited about Rethink. To rephrase my question/observation: your article clearly lays out why you think it is better than MongoDB, using some quotes from people who agree with you. However, without some real-world data, it is still an argument rooted in theory. I like theory, but I also like to take real-world data to my bosses. Do you have any stats/examples that actual compare and demonstrate the performance? (I understand that wasn't the purpose of your article, just asking as a follow up).

Regarding the many-to-many joins: I was just playing around with a contrived example: "a blog post has and belongs to many categories." Mainly I was just curious how to do it, I didn't _need_ it for anything. But, I couldn't figure out how to write it with the query language. I was using Python DSL, FWIW.


I also hear you, there will be real data soon. We just want to be careful about making sure everything is sound before publishing information.

As for many-to-many joins, I'll write something up about it, thanks!!


A many to many join should be possible in RethinkDB using `innerJoin`. What did your data schema look like?


Riak is NOT operations-oriented. It's nearly impossible to manage operationally without dedicated staff at scale and the tools to introspect and analyze and deal with failures aren't robust enough yet.

I know they're just trying to contrast Riak and Cassandra with Couch and Mongo, and that Riak is designed to shard easily without the developer having to think about it.

That philosophy actually is "developer-oriented" in that it SEEMS like an operational savings because it was designed by developers.


Chief Architect at Basho here:

Saying Riak is categorically non-operations-oriented is a bit hyperbolic, but I will be the first to acknowledge that we need even more visibility into failure-recovery / degraded mode situations. I've spoken to a few customers who have "cheat sheets" of Erlang console commands they use to debug things like handoff slowness or poor performance in general. This alone means we need to do better,

On the other hand, Riak continues to function in scenarios where other databases would be completely unavailable. I'll take immature visibility during those situations over complete unavailability any day,

I appreciate your feedback - I can assure you that this is something we're constantly working on and you'll see improvements with each release.

Finally, if you've been bitten by anything specific you'd like to see fixed, we do all our development in the open at http://github.com/basho, so github issues, pull requests, etc go right into our internal tools and workflows.

Cheers,

Andy Gross


Would love to give some feedback. I like Riak and I believe in it - over the long haul I'm 100% sure it will be an amazing product.

Let's grab beers at Erlang Factory!

Chad


Sounds good - I'll see you there!


Can you provide an example of where Riak was "nearly impossible to manage operationally without dedicated staff at scale and the tools to introspect and analyze and deal with failures aren't robust enough yet"?


Just watch the talk by Voxer on Riak (on Basho's site). It's basically an hour long explanation of why the things that come out of the box with Riak don't work for them. We ran into those problems as well.

Also, I'm giving a talk on it in 9 days at ErlangDC...Whisper is now top 10 social networking apps and we had a number of critical Riak failures. i'll be elaborating on them, though the focus of the talk is not to bash Riak, no pun intended... just to provide our experience and how we worked around it.


That's an interesting take -- I think the Basho team positions Riak as an operations-oriented system (at least that's how I always thought about it, though I haven't used Riak in production).


Not mentioned: RethinkDB doesn't yet support secondary and compound indexes, which is a dealbreaker for a lot of setups

Definitely looks interesting though, and I look forward to playing around with it at some point.


Read the article:

"Some key features like secondary indexes and live backup are still in development"


Sorry, we added limitations a few hours after posting. I should have been more clear about the fact that they were edited in. I'll try to get better at this live blogging thing :)


Please keep talking about limitations in your marketting, even as the product gets more mature and has fewer and/or different ones. (There's _always_ limitations). Learn from the mongodb backlash.


I'm eager to take RethinkDB for a spin, as soon as secondary and compound indexes are fully implemented.

[1] https://github.com/rethinkdb/rethinkdb/issues/88

[2] https://github.com/rethinkdb/rethinkdb/tree/jd_secondary_ind...


Every NoSQL database is perfect and better than all the other options until you start using it in the real world. I'm not saying rethink DB is not a good solution, the point is, nosql dbs are about compromise and specific problems.


I agree. I think NoSQL began exploding when people started doing more custom work and pushing further away from big orm based frameworks because of NoSQL's relaxed schemas (or lack thereof) making it easier on developers. I still really like NoSQL for those reasons, but like everyone knows, in the real world there are a lot of gotchas, which seem to come from the designers trying to solve one purpose (many specific DBs) which makes sense considering the yet uncharted territory of NoSQL (proof of concepts). I'm excited to see how RethinkDB fits into the real world though. It looks as though someone went back on years of research across multiple experiments and took what was right and built it. But that said, just because one shirt is really nice and a pair of pants fits amazing doesn't mean they match, and that's why I agree with you.


Hi, Rethink is my baby. It is beautiful, but it is not perfect :)


Off topic: Slava, I wish you could resume writing for your blog - you've always had interesting things to say.


Thank you. I tell people I've stopped because I don't have the time. Really it's because I've since realized that the world is much more nuanced than I gave it credit for, but I have nowhere near the sufficient writing skill to express it. I'll try to get back into it.


I'll be looking forward to it! Thank you very much for the time you have already already put in your writings and the future.


I wasn't criticizing rethinkDB, only the "marketing speak" or excessive excitement that normally comes with every new nosql solution. Anyway, I wish I had the skills to create such a complex piece of software, congrats =)


Looks very interesting, but this statement in their FAQ is a red flag for me:

How can I understand the performance of slow queries? Understanding query performance currently requires a pretty deep understanding of the system. For the moment, the easiest way to get an idea of why your query isn't performing well is to ask us.

Wish RethinkDB was a little further along because it seems like it might be a good fit for a new service I'm building.


Michel @ RethinkDB

We are building a tool to explain in a nice way how the query is executed, what are the bottlenecks etc. It should make it for 1.5.

You can track progress here https://github.com/rethinkdb/rethinkdb/issues/175 (it's kind of empty for now)


Interesting, I'll keep an eye on that. When do you think RethinkDB will be ready for production use?


We aim to be ready for production in 6 months.


This just reads like marketing speak. What are the disadvantages of RethinkDB?



You mention comparing against Hadoop for computationally-intensive data analysis. Would Rethink be suitable for a several-terabyte dataset with non-computationally-intensive analytics?

Currently we're using Hive and Python over streaming Hadoop. There's no significant ongoing data accumulation; we're just analyzing the data we have.


We haven't tested on workloads like that, but I can't think of anything that would prevent this workload from working well. The idea behind Rethink's architecture is to eventually allow people to run full analytics on the same cluster as their live app. Currently we're optimizing for online-type queries, but you can run analytics queries too, it's just that we haven't given the optimizer enough love on that front yet.


personal != marketing

P(biased|personal) -> 1


How's the general performance and memory consumption on smaller machines, e.g. entry-level VPS's or the lower spectrum of AWS VMs? Don't have any big projects in the pipeline that immediately required sharding etc, but would like to play with it on a few weekend-scale items.


We did some testing and it should be great, especially once https://github.com/rethinkdb/rethinkdb/issues/97 makes it in.


When I last tested I maxed out at about 700 inserts / sec with nothing else happening (MBP Retina, SSD, etc) - it's not bad, but not as fast as Mongo.

I'm going to benchmark it when I get some time!


This is a known limitation -- it will be resolved when we fix https://github.com/rethinkdb/rethinkdb/issues/207. We just picked the safest default (fsync per op) and shipped the product. #207 will make things a little smarter.


700 inserts / sec is still quite a lot for a single laptop and I'm sure it will get better over time.

:-)


This comparison speaks better than this - http://www.rethinkdb.com/docs/comparisons/mongodb/


this is marketing cloaked in a developer portal. I think it's great that rethinkdb is trying to distinguish themselves from Mongo, but what's the real marginal utility of a rethinkdb over Mongo?

Mongo has been around for years, and it still has problems.

Rethinkdb is just launching a new product that essentially does the same thing as Mongo, but is maybe just a little easier to use.

I think the Yet Another Database (YAD) question still hasn't been answered by this post.


We have been looking at NewSQL(or even NoSQL) platforms for our databases at my place of work, and we also stumbled upon RethinkDB. While everything these guys say does sound amazing, we were looking for someone who has implemented it, or any third party case study about it. Since we couldn't find any, we decided not to go with RethinkDB for now.

Does anyone here know any big website/service which uses RethinkDB?


It's brand-new, and not recommended for production use yet, so I highly doubt that anything like that exists.


    «An asynchronous, event-driven architecture based on 
     highly optimized coroutine code scales across multiple
     cores and processors, network cards, and storage systems.»
It may be a dumb question, but isn't this statement a bit contradictory? As far as I understand, event-driven design and coroutines (i.e. cooperative multitasking, lighweight threads, etc.) are the techniques usually chosen to AVOID concurrency.

How does such a design imply multicore scalability? Obviously, coroutines and event loops don't prevent you from running in multiple cores. I just fail to see the correlation.


This is a great question. We start a thread per core, and multiplex thousands of coroutines/events on each thread. When coroutines on different threads need to communicate, we send a message via a highly optimized message bus, so cross-thread communication code is localized. This means each thread is lock-free (i.e. when a coroutine needs to communicate with another coroutine, it sends a message and yields, so the CPU core can process other pending tasks). The code isn't wait-free -- a coroutine might have to wait, but it never ever locks the CPU core itself. So, as long as there is more work to do, the CPU will always be able to do it.

If instead we used threads + locking like traditional systems, we'd have to deal with "hot locks" that block out entire cores. Effectively we solved this problem once and for all, while systems that use threads + locks (like the linux kernel) have to continuously solve it by making sure locks are extremely granular.


Sounds very Erlang-ish. Did you copy that deliberately?


We do effectively have an ad-hoc mini Erlang runtime that we wrote at the core of the system. I'm not sure how deliberate that was -- we sort of borrowed performance ideas from many places, tried a lot of different approaches, and settled on this one. Lots of this was definitely inspired by ideas from Erlang.


There definitely seems to be a version of Greenspun's Tenth Rule for Erlang. But I think Greenspunning has gotten too bad a name – sometimes implementing a subset of a classical system is exactly what you ought to do, for example when your problem allows you to exploit certain invariants that don't hold in the general case, or for some reason using the classical system itself (Erlang in this case) is not an option.


Right! Rethink has an adhoc Erlang runtime for message processing, and an adhoc lisp for the query language. I'm both ashamed and proud of this at the same time :)


You get less overhead by using less often the low-level concurrency primitives that involve cross-core synchronization. Cross-core synchronization happens in rethinkdb mainly when you see an on_thread_t object constructed or destroyed (and also in a few other places) and those get batched when you have more than one per event loop (which is not necessarily good, inflated queue sizes is also something to be wary of). So if you want to attach a bunch of network cards and high-speed storage devices on opposite ends of a handful of CPUs, your throughput won't be hindered by the fact that millions of threads are trying to talk to one another.


Anyone know if there is a Lua driver for RethinkDB in the works? I suppose I could use the C client and think about generating one for Lua, but maybe someone has already done that?


There isn't one yet. We'll publishing the new, much simpler spec for client driver writers. It'd be pretty easy to do a native Lua driver (based on protocol buffers).


Other than RethinkDB, BigCouch looks both Developer/Operation oriented database since it is a Dynamo-like CouchDB. Does anyone have a BigCouch experience?


Where is the Windows port? and if there is one will it always be second rate compared to the Unix and Linux port?

Also where is the .Net driver?


geo support? i think No!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: