Hacker News new | past | comments | ask | show | jobs | submit login
You Only Wish MongoDB Wasn't Relational (seanhess.github.com)
119 points by embwbam on Feb 1, 2012 | hide | past | favorite | 44 comments



"Denormalization might be important"

Denormalization is still important. Taking that comment example a bit further, if you need user records tied to the comments, you're going to want to include all the user information that you need for displaying the comment in the comment record itself. Mongodb doesn't have joins. So if you have 200 comments on a post by different people, you're going to have to grab those users from mongo in a separate call, and if you're sloppy about it, 200 separate calls.

If your users really need to change their user names, you might run a job that updates the comments async.


True and untrue.

I use MongoDB for my startup (yabblr.com) and have quite a relationship data stored in my entities.

For example, my user entites have an array of commentId's stored inside of them.

Whenever I fetch 1 or 1000 users from the database, a framework that i designed will look at the entire list of users and build a single list of commentIds based on that list.

Then I do one more lookup on the database to find all comments in my new list of comments.

Then a final pass over the users looks up the commentId associations and inserts the actual comment object into the user object.

Finally the list is passed back to the consumer. Its 2 database calls. It does require that there are 2 iterations over the first call (one to gather up comment Ids and one to associate comments with users) but since its being done in the application layer, its much easier to scale.


We were actually considering MongoDB for our system, given its scalable, high-performance, and open source qualities. However, since we are building a payments platform (specifically, a third party payments aggregator), we soon learned that our case is the exception where MongoDB--or any other pure NoSQL DB for that matter--is not an ideal solution (and is specifically noted as a less well suited use case on the MongoDB site), because NoSQL DBs are not transactional.

Any possible work-around proved to be complicated and tedious--likely requiring heavy application-side logic--if not inadequate. Going hybrid was an option (RDBMS for critical parts, and NoSQL for other non-critical parts), but that may not have been a prudent way to start building this thing, since it would be difficult to predict the portions that might be correlated or codependent in the future (where a JOIN between the RDBMS and the NoSQL DB would be inefficient, if not impossible).

As a payments platform, we cannot afford to lose any transactions, and every penny must be accounted for, so a transactional and ACID-compliant data storage is a must. Thus, we decided to go with a traditional RDBMS. With that said, and looking towards the future, I am considering the use of an ultra-high throughput RDBMS such as VoltDB for our system... A VoltDB vs. MongoDB (or any other NoSQL DB) would be interesting.


You do realize the entire banking system is not built with the idea of ACID/transactional databases in mind? You only really need durability guarantees to implement banking properly. Bank accounts are eventually consistent logs of immutable transaction data which are reconciled in batch.


You might not need ACID during the recording of each charging transaction; however, without ACID, you won't be able to correctly apply each transaction against your account balance during the reconciliation batch processing. Marking each charging record as charged and updating your account balance must be in one transaction, whether relying on the database transaction support or building your own transactional log.

So at the end of the day (yes end of the day processing), ACID is needed.


The account balance is calculated as a function of adding all the debits and credits together. Bank accounts are not stored as some kind of balance value that gets mutated over time. The balance is simply the result of some operation that is cached. It's quite simple. No ACID required. Most of these systems at large banks were built in IMS long before relational databases even existed.


>"You do realize the entire banking system is not built with the idea of ACID/transactional databases in mind?"

Well, I was referring to payments/transactions, not banking. As I specifically said, we are building a third party payments aggregator (TPPA). So, we handle transactions; we're not a bank.

In our case, there are many scenarios where ACID is needed. A couple of the most basic are the need to rollback a transaction (e.g. if it was cancelled, or if it did not succeed) and the need to recover from a failure (specifically, to get back to a consistent state to resume a failed transaction).


Since when does NoSQL mean no ACID? I thought any key-value store would be considered noSQL, and many of those are ACID.


That's a fair point. Some NoSQL products do provide ACID guarantees within the scope of a logical record. That's usually as far as they go, though. So for transactions that cross logical record boundaries and/or are commutative (i.e., the "sequence of things" matters), you would have to build the outer transaction scope in the app tier. It can be done, but it's the kind of heavy lifting that RDBMS products do pretty well. As always, right tool for the job.


One possible solution - put your userDB in a real DB (they are your crown jewels, and deserve a bit of special care), cache user data in a fast cache (local Mongo, Redis over a memory socket, Memcached, whatever - you do have a cache, right?), and let cache invalidation update comment names in due course.

I'm pretending there's two types of data - the "nexis" of the star (comments), which needs a fast DB, and the leaves (users), which are small enough to be cached.


This is exactly how we do it at my company.


Well, that makes me feel less crazy for doing it.


Was there a movement of "let's not design databases" and thus this article is needed to rebut it?

Design your database (or organically grow its needs), and you'll wind up breaking things apart appropriately.


Well, "schemaless" is one of the NoSQL buzzwords, sometimes glossed as "no data model required", which can verge on "don't design your database, just put stuff in it", though nobody actually advocates anything quite that strong.


Just a quick note: There is a feature planned called "virtual collections" for querying the embedded collection as if it were a standalone collection.

That way you could easily do db.posts.$comments.find({"name": username})

to retrieve all posts by a user, or

db.posts.$comments.find().sort({date:-1}).limit(10)

to retrieve the latest 10 comments on the blog.

Link to the feature in MongoDB's issue tracker: https://jira.mongodb.org/browse/SERVER-142


On a related note, Mike Dirolf, former MongoDB engineer, has a nice post on MongoDB data modeling on his new venture's (fiesta.cc) blog, which I found very helpful:

http://blog.fiesta.cc/post/11319522700/walkthrough-mongodb-d...


From the examples you mentioned, it seems sql is a better option for you.

sql databases scale pretty well too.


I understand that you can use MongoDB in a relational fashion, but I don't get how this is showing why. The mongo docs have examples showing what this blog says can't be done.

http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+F...

> t.find({}, {'x.y':1}) { "_id" : ObjectId("4c23f0486dad1c3 a68457d20"), "x" : { "y" : 1 } }

And the mongo docs even have an example pretty detailing what this page wants (paging of comments)

db.posts.find({}, {comments:{$slice: -5}}) // last 5 comments

So, what am I missing?


I should have addressed $slice. Yes, it lets you get a subset of comments, but it doesn't let you get the subset matched by your query. For example with this document:

{ "_id" : "one", "comments" : [ { "name" : "a", "asdf" : "1" }, { "name" : "b", "asdf" : "2" } ] }

The following query returns the comment "a", not "b"

> db.posts.find({"comments.name":"b"}, {comments: {$slice: 1}})

So, you're right, for paging, $slice works fine. For more advanced queries (ranges, the comments by user example), you're stuck. It sounds like 10gen is working to correct this though.


Yeah, I think the comments by user, I had done using Foursquare's Rogue, so thought it was part of mongo natively.

At least they have tickets open for it.


I build most of my apps this way. The main idea is that any time users can add content, it goes into its own collection. This way, records don't grow indefinitely (and hit the 4mb limit) but it also gives you a lot more querying power.

You sometimes want to just get the blog post object too, not all the comments. It's worth it to have to do two queries and get the extra selection power vs one round trip to the DB, but you have to potentially sift through a lot of data to get what you want (or get back a lot of data you don't need).

This is, in my opinion, the perfect mix of the relational/document schemes. You can nest things when they don't grow continuously, but you still get most of the querying power of a relational database.


> You sometimes want to just get the blog post object too, not all the comments.

I feel this should be clarified a bit (since you most likely have a more complicated scenario in mind) but it is entirely possible to only select certain fields to be returned from a query. So if a 'comments' document was inlined within a 'post' document, we could easily opt to only have the latter returned by mongod. Such a query might look like this:

    db.posts.findOne( {'_id' : <...>}, { comments:0} );
This would definitely cut down on the amount of data transferred over the wire, but I am not sure if there is any performance hit within mongod.


Just a quick correction - the default document limit has been 16mb for quite some time now.


I might be missing something from the post (which I skimmed), but when the author states that "there’s no way to get only the matched comment back" in the "When Nested Documents Become a Problem" section, isn't that exactly what the $elemMatch operator was designed for?

http://www.mongodb.org/display/DOCS/Advanced+Queries#Advance...

The syntax is a bit wonky, but I believe it does exactly what the author says is impossible.


It's less the match, and more what is returned by Mongo. If he matches comments.name, he wants the comments.text to go with it, not the entire document.

It becomes an issue when you have extremely large arrays in the document (e.g. thousands of comments).


Exactly, I just updated the post with examples of why slice and elemMatch don't work.


Or use postgres. Whichever.


Heresy! PostgreSQL is not WebScale - the bear told me so.


Making collections is cheap. Having one or two extra collections and joining data manually when needed isn't too bad. Then business requirements change and management makes you add new features, so you add more collections. One morning you show up to work and realize you're doing all of your joins and aggregates manually. You start devising a plan to switch back to a relational database...


When I see the first example -- comments as a collection of the post document -- the first thing I think of is that editing a comment requires one to retrieve the document, edit it, then save it back.

Is there some sort of repair process in mongo that prevents destructive updates in the case of 2 concurrent writes?


Mongo is single-master, so last-write-wins is how writes are resolved.

However, they do support partial updating. So you only need to write the specific part of the document that you are actually changing.


This is fine if your site is small and don't have a lot of users, but adding fields or data to existing documents is extremely expensive in mongoDB. Just don't design your product around this.


This seems like a relatively simple indexing problem to me. Why not just use another data structure to store the association for fast querying?


What would the advantages over the former approach?


Why not "events_users" collection for attendees map? What's the benefit of embedding attendees inside events collection?


Anybody?


what about cascading deletions?


Just run 2 removes. Delete the post, then delete all comments with that post id.


And hope that you don't have a crash, restart or connection failure in between the two operations...


>> This works nicely with the key/value idea that you should denormalize your data.

I wonder if the NoSQL crowd is affiliated with the REST crowd. :)


Does MongoDB have any advantages over using de-normalized tables within PostreSQL?


Mongo is specifically designed for horizontal scaling, so it makes it easy to establish replica sets and shards, and it includes an auto-balancer that will distribute your data between the shards. Pg (the database, not the guy!) can do much of this, but it's not quite as built-in.

Which isn't to say there aren't things that Pg is better for. Full text search comes to mind.


Or using the hstore? I would also like to hear an answer on this.


You usually don't have to think about schema. This is the exception, but there are quite a few things you CAN represent with a nested document.

You usually just make an object in whatever format your app wants, save it, and you can write query against it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: