Although I'm very interested in the subject matter (I have a production site that uses MongoDB and AWS), I often find presentations like this incredibly annoying to read.
The problem is that presentations are short-form jot notes that are meant to be supported by a delivery as well. The vast majority of the value comes from what's not written down.
What's worse is that the "signal to noise" ratio detracts from my ability to understand what you're really going on about.
Some examples:
You say "MongoStat is Amazing" why? how? what does it do? (I've personally played with and used it a little, but I'd like to know why you're amazed by it. How can the things you've learned improve my practice?
You mention starting with 1.2 - does that mean you're not using 1.8? I've used 1.2 and was suffering 80% - 99.9% CPU load, as soon as I updated to 1.81 the load dropped to closer to 0.03% for most of the same operations. Does that mean your talk is invalid? Obviously there's a lot more to your story that I can get from your slides.
All things considered though I'm no trying to come off as a jackass. Normally when I see a HackerNews->slidedeck I just get frustrated and annoyed by things like the "inside joke" and "cute kitten slides" that rely on the speaker to deliver some insightful punchline. In this case I did read one or two interesting points.
Though I find I'm still suffering from the signal-to-noise problem that accompanies all these decks, this was better than most. I'd love a bit more narrative around the points that matter and a bit less around things that don't (how to query in mongo isn't really necessary in a story of why you abandoned it, is it?)
Anyways - was that a rant, review or stream of consciousness?
TD;DR Thanks for the slide deck; I hate slide decks; Yours was less terrible than most.
Audio should be coming, but I'm not sure it will help. I definitely tried to pack way too much into a 45 minute talk.
Not sure how to explain mongostat further... have you run it on a loaded server? Just a great little near-real-time stats tool. Like vmstat for a db. It could probably have a talk all its own just to explain how to diagnose performance issues with it.
Most of my comments apply to MongoDB 1.4 although our little bit of data that remains in MongoDB has been upgraded to 1.8 (via 1.6 first). I've never had MongoDB use a significant amount of CPU load (from 1.2 - 1.8).
The presentation claims that MongoDB performance drops off a cliff once the data or the indices no longer fit in RAM. This is not entirely complete/correct.
First off, the same problem occurs no matter which database. Oracle, PostgreSQL, MySQL, whatever, they all grind to a halt when stuff no longer fits in RAM.
Second, it's not really the whole data set or the whole index that needs to fit in RAM, it's the working set. This is a pretty big difference. I'm currently running a production MongoDB database which is about 20 times larger than RAM and at first I made the mistake of using random string primary keys (_ids). This ensured good distribution between shards but also ensured good distribution within the primary key index. This essentially made the working set of the primary key index equal to the entire index. Once the primary key index no longer fits in RAM, the database grinded to a halt. I now use a primary key consisting of a timestamp followed by some random text. The timestamp ensures that inserts only hit the end of the index, thereby keeping the working set small. The database is now almost as fast as when it was fresh.
This harsh cliff does NOT occur on every database. The old guard RDBMSs have extremely sophisticated and well-tuned I/O code that has been working around slow disks for decades. While you can't escape the physics of rotational drives, MongoDB does probably the worst job possible of dealing with their poor performance. Yeah, great, it's fast as long as data is in memory, welcome to the world of b-trees. With a high disk:RAM ratio for a data set (like >30:1), it's slower than PostgreSQL. Without compaction, it's performance degrades to become many of orders of magnitude slower than PostgreSQL.
The other major problem is that there is no tooling that can be used measure how large your working set is in MongoDB, so by the time you figure out you're near the cliff, it's WAY too late.
MongoDB encourages poor decisions in query design that lead to these problems. The "rich query model" is enough rope to hang yourself with in a big data / high scale situation. What makes MongoDB popular is that it's got much of the convenience of SQL queries, but the problem is that it also falls into the same set of traps. This is exactly what happened to Foursquare -- MongoDB's marketing materials won't help escape the realities of data locality, the scale-up hit, poor concurrency design, and rotational disk speed coupled with mediocre caching infrastructure and an unpolished I/O subsystem.
As far as I'm concerned, MongoDB isn't for big data or scale -- yet. It's best use is for small to medium sized data sets which benefit from it's schema-free design.
While it's true that some other DBs have better code for working around slow disks, does it really help all that much? In my case the random primary key was causing every single operation to result in a seek to a random location on disk. I really doubt other DBs could have helped me no matter how good their code is. At best they postpone the core problem.
What tools do you use to measure the working set in PostgreSQL and other databases?
A random operation results in random I/O, another unavoidable problem. What I speak of is the way in which modern, MVCC-based databases deal with concurrency and optimally pattern writes in a way that makes reads lock-free. In addition, they don't sacrifice any level of durability or ability to quickly recover a corrupted database to get this. It's just a more advanced design.
Usually each table is kept in it's own file (or set of files, if needed), which helps sequential I/O for a single table stay sequential, and also provides a point of logical partitioning for database operations.
In contrast, MongoDB stores ALL of the data for a database in a single set of sequentially numbered files. This works great if you have a single collection that can be ordered monotonically and queried in a roughly time-ordered way, and you never delete anything, but it's pretty bad otherwise. MongoDB was originally built for analytics, and it shows.
In this modern RDBMS, everything gets written first to a write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation. Many DBAs insist on a dedicated disk for their WALs so that the drive head is dedicated to this sequential I/O. There are also often multiple WALs, allowing multiple threads to write to disk concurrently. Most MVCC databases default to a READ-COMMITTED isolation level (this is actually relaxing some ACID semantics somewhat), and in this level, performing reads and writes do not block each other.
When these WAL files grow to a certain point, the data in them is ordered to be written to the table files as sequentially as possible in lower priority threads. Multiple updates are resolved and collapsed. Parts of the database files that contain deleted data are marked as such in some kind of heap structure that keeps track of this free space. Whatever index updates need to be made are done as well, and it's much more optimal to do these in bulk at checkpoint time than it is to try to do them in-place. Doing bulk index updates is FAR more preferable than doing them in-place at write-time, an indisputable fact.
PostgreSQL in specific compresses large data values (ones that would push the row size beyond the 8K page size) and transparently keeps them in a separate TOAST table. The values are broken into 2,000 byte chunks. This prevents the main data files from getting too sparse when row sizes have a high level of variance.
Cassandra borrows WAL-writes-first-then-lazily-checkpoint technique, but takes it a step further by using sorted, immutable table files. Cassandra doesn't have MVCC, so making the table files immutable means that there's no need to worry about locking as the contents never change. The sorted order means that merging table files together at compaction time can be done in roughly O(N*2) time.
All that work in making I/O as sequential as possible significantly softens the blow of running out of RAM cache for your most active set. Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.
To answer your second question, as far as I know, what I've done to to measure how far away PostgreSQL is from the RAM cliff is to set the shared_buffers to 2/3 RAM (or 3/4 RAM if you're >8GB, 4/5 RAM if you're >32GB, etc), and measure hit rates between the buffers and OS cache. Hit rates can be measured at the database level and also down to the table level. The idea here is that if your shared buffer hit rates begin to drop precipitously, you need more RAM, but also gives you the another 1/3rd of the RAM on the box before you're truly doomed. PG works well with the OS cache, so performance should remain steady.
MongoDB can't give you this indicator because it's dependent on the OS cache entirely. I've spent a day looking for tools that would allow tracking how large the hot pool of Linux disk cache was, with no success.
PostgreSQL also gives you per-table statistics on how many queries are ran, and then how much and what types of I/O are being performed, so catching problems and pinpointing them can be done before they become problems.
Thank you for this informative rely, this kind of rely is what I'm looking for. There are a few things I should reply on:
> MongoDB stores ALL of the data for a database in a single set of sequentially numbered files.
Correct. But are you under the assumption that this means documents are spread randomly inside the files? MongoDB database files work with extents. Each collection consists of multiple extent. The extents may be scattered across the database files but each extent is contiguous.
> write-ahead log (WAL), in which the database will group several transactions together to make this a mostly sequential I/O operation
Very informative, thanks. I thought the WAL was only a durability feature. Indeed, all the resources that I've read on WAL so far only mention it in the context of durability. I had to read your post a few times to understand that it can be used for optimizing I/O.
That said, MongoDB relies on the operating system's I/O scheduler to perform the reordering. It just performs the write without fsyncing, allowing the writes to end up in the OS page cache. The OS is then free to reorder the writes however it likes.
How useful is the WAL for reordering I/O compared to what the OS already does? Do you think one is superior to the other? Other than bulk index updates of course.
> Only random reads actually become slow in this case, instead of practically every kind of I/O like it does in MongoDB when used in real-world scenarios.
Actually after fixing the _id problem my MongoDB database became insanely fast again. I/O reordering does happen in with MongoDB, just not in MongoDB itself.
> MongoDB can't give you this indicator because it's dependent on the OS cache entirely.
Your technique for measuring the working set sounds not too far off from what I do with MongoDB although I use a few more tools. The 'mongostat' utility shows how many page faults MongoDB gets; this is similar to your buffer miss rates. You can use Linux cgroups to limit MongoDB's page cache size. I also use mincore() and fincore() to see which places in the database files get paged in.
All in all, I have to change my opinion and admit that PostgreSQL does do many things better than MongoDB. However I think MongoDB doesn't do as bad as you think it does, and the automatic sharding feature in MongoDB is still something I really need.
You are correct that it's broken into extents, but as the extents become sparse over time without compaction, I/O gets scattered further and further around the disk.
The WAL is as much a durability features as it is about fast recovery, fast writes, and concurrency. MongoDB uses a journal collection for durability, which is quite a different thing, and actually requires doubled writes. The journal collection is just another MongoDB collection, instead of being an optimized log file structure.
The OS I/O scheduler works fairly well, but it's a general purpose tool and must work well for all types of I/O. The WAL writes themselves are ALWAYS sequential, something in-place updates will never get. The OS still tries to flush writes within something like up to 5 seconds (the Linux deadline scheduler). The RDBMS WAL checkpoint process can gather hundreds of megabytes of data over many minutes or hours, sort it, and write to the table files in a single pass.
In PostgreSQL, you can tune the checkpoint process down to how much data to write before a checkpoint, how long to wait at maximum, and how much I/O capacity to consume while checkpointing.
The cgroups, mincore, and fincore techniques are very interesting, going to look into them.
The key to the PostgreSQL strategy is that segmenting the cache into in-process buffers and the OS disk cache. It's an early warning system once the hits start to cross that threshold.
I'd be interested in learning about any strategies that could be used to segment the OS disk cache into multiple pools, because I'd prefer to actually reduce the shared buffer pool as it's somewhat less efficient than just using the OS disk cache (duplicated caches). Perhaps some kind of automated process could monitor the hit rate and actively expand the cgroups as needed to try to meet a hit rate target, and give you some warning when a RAM % threshold was exceeded.
There is nothing evil or wrong about MongoDB itself, it's just a new product that needs more polish. Many years ago when I first started using PostgreSQL, it needed much polish. It takes years and years to fine tune these types of things and find out how the initial assumptions made work out in real life.
In my experience the old guard RDBMSs have a lot of performance cliffs that they love to fall off of with no warning when you try to scale them.
Ever seen an Oracle database encounter too much contention on a lock? When the lock is operating at 98% capacity there is no trace of a problem. At 101% of capacity you fall over. And there are no tools to catch this.
You're absolutely right about the working set being the important bit to keep in memory (and all the indexes you use). I clarified that in person thanks to a great question from the audience.
My entire point wrt it slowing-down-outside-of-RAM is that it happens with any database. My point was that any database can be fast if it's in RAM and non-durable; yet Mongo trumpets performance under these unsafe scenarios while other databases use safe defaults.
To make matters worse, MongoDB data can suffer from poor data locality over time. Since caching is completely left up to the kernel, 1 hot document, even if it's only 200 bytes, will cause an entire page to remain in memory. The other bytes in the page may be cold documents or null data.
This is why flip/flopping is useful in current versions and in-place compaction so important for future versions. For prior research see defrag.exe.
I wish you luck in your attempts to tune performance by tweaking _id. I could only take doing hacks like that for so long before I looked for greener pastures.
Why do you consider tweaking _id a hack, and what is the greener pasture for you? If you use random string primary keys in any other database you'd run into the same problem. My app is much more write-intensive than read-intensive. My stuff wouldn't work on PostgreSQL at all, I have so much data that I must use sharding.
You're basically implementing an optimized version of ObjectId for your use case and having to take a couple stabs at it as well. No big deal, but there are just lots of little tricks like that to learn with MongoDB.
"My stuff wouldn't work on PostgreSQL at all, I have so much data that I must use sharding."
We had to implement sharding as well and chose to do it manually in PostgreSQL. Luckily our schema made that relatively easy and natural. YMMV
I have a 1.5 TB Postgres database, sharded by schema, that runs wonderfully on a single box (12 core, 36GB RAM, raid 10 of 15k SAS drives). Why couldn't you shard with Postgres?
Good point, sounds like I intend to keep it that way for this particular database in my comment.
It's setup for multi box (each schema is mapped to a hostname in code) but I simply haven't had a reason to move to more boxes yet. The schema feature is a nice, convenient way to pre-shard like this so that growing to more boxes doesn't require rehashing for a very long time if ever (depending on how much sharding you do up front). You just move schemas/shards as needed using the standard dump and restore tools and update the schema->hostname mapping in the code.
If I need to go cross-shard then I am doing it in code. If you knew both shards were on the same box you could do cross-shard joins if you used schemas like this but you would need some potentially tricky logic that determines if it is working with all shards on the same machine.
Thankfully most of the joins happen within a shard (hashing and sharding on something like a user_id) with the exception being various analysis and aggregation queries.
Using PostgreSQL's schemas is admittedly not too different from just using many DBs in MySQL or something else but in practice I've found that extra layer of organization helps keep things neater. I can backup, move, or delete a specific schema/shard or I can backup, move, etc all shards on a machine by operating on the containing database.
FYI - Your _id trick is similar to the ObjectID type mongodb uses by default.
"A BSON ObjectID is a 12-byte value consisting of a 4-byte timestamp (seconds since epoch), a 3-byte machine id, a 2-byte process id, and a 3-byte counter. Note that the timestamp and counter fields must be stored big endian unlike the rest of BSON. This is because they are compared byte-by-byte and we want to ensure a mostly increasing order."
My _id is different from ObjectID. It does begin with a timestamp but one that has 2010 as epoch. It's also followed by a kind of user ID and a piece of random identifier that appears elsewhere in the document.
That totally depends on your workload. In my case my working set happens to be mostly equal to the most recently inserted data. If you have to regularly access lots of random documents with no locality whatsoever then your working set is very large and should fit in memory.
By sticking with the default _id, with my workload my _id index doesn't have to fit into memory. I can't actually use the default _id for various reasons but that's a whole different discussion.
Forgive the dumb question but what do you mean by a working set? Is this the part of index that covers the shard that server is on? Or is your domain such that you only care about most recent records - for my part it's rare that a table has records I will not want to see and so can be in the non ram part of index?
I think a reread of the op suggests that only a subset of data and indexes are in ram - but my question still stands but slightly differently phrased - are you saying best practises is shard your dbase and place shards in ram in as many servers as needed?
The best practice totally depends on your work load and what kind of trade offs you're willing to make. I believe this is exactly the reason why there doesn't seem to be any good "best practice" documents for databases: it's different for everybody. In my case, there are a lot of writes and only occasional reads. My database is insert-mostly; once inserted, the data rarely changes. All data is timestamped. The reads usually pertain the most recently inserted data. In my case it makes sense to optimize the database in such a way that the most recently inserted data is in my working set.
This does means that querying old data is slower than querying recent data. This can be made faster by ensuring that all data fits in RAM, e.g. by adding more shards. But given the huge amount of data I have, placing all shards in RAM is way too expensive, and I'm not confident that my users are willing to pay 20 times the fees just to have faster queries for old data. The alternative is to store things on SSDs which should also make things faster, but this would still result in a 3-4 times fee increase for users.
Well... no shit, people. For whatever reason, many developers think MongoDB is an SQL replacement.
To be fair, MongoDB did a very good job at hiding information which made it look bad, while posting misleading benchmarks. I think the majority of the user-base doesn't know MongoDB isn't durable out of the box.. they think it's just a "faster", better MySQL.
Doing solid research goes a long way to tone down the hype.
For the record: we (MongoDB) have a policy of not posting official Benchmarks. I'm not sure what benchmarks you are referring to that were posted and misleading.
The only benchmarks we have ever done have been for internal testing comparisons between different versions.
Previously there were several third party benchmarks on this page; but if you examine the page history we have always clarified them as unaffiliated third party sourced.
I actually tend to believe many developers realize that data in a relation DB is not the only way to go about things. Having spent my entire adult life in a SQL bubble it's refreshing to have all these alternative tools now that can do more for me as a collective whole. But of course blindly doing anything without a reason is a bad idea.
For me, Mongo blows any RDMS out of the water with regards to prototyping and getting an idea into a working solution. That alone is why I use it exclusively for new projects - under the guise that if any of the projects ever become successful I'll have to split things out into their respective buckets. So far, I haven't been fortunate enough to have these problems.
Can you give a concrete example where MongoDB blew an RDBMS out of the water in regards to prototyping? I find that database schemas don't really get in the way until a project matures, and then it becomes a road block both in a legacy manner and that making changes puts strain on the database infrastructure.
I can't remember the last time a schema I made in week 1 was similar to the schema in week 4. That alone is why I made that statement. Modifying schemes requires modifying code all over the place that relies on it. The majority of the time, with Mongo, I can ignore those changes and only code for what I added.
That is my experience. Perhaps you have the patience and foresight to plan out your schema well before coding.
For the record something I did not mention in my slides or talk is that our data was not highly relational. For very similar data we've used non-SQL databases with success.
Definitely should be something I make clear in my talk, so people don't think we foolishly tried MongoDB for highly relational data.
Thanks for the clarification, my comment wasn't targeted at you actually (it seems like MongoDB could have been a good fit here), but at developers who don't even consider the kind of usage, but instead think of MongoDB as a "better MySQL", when they are in fact fundamentally different.
So thank you for posting this, it may make some people reconsider before they jump ship, or at least understand the trade-offs.
Cassandra, but mainly because we have a couple devs who are excellent at debugging Java/JVM issues. Cassandra is young enough you really need to expect to get your hands dirty in its code to keep things humming. (I may be more forgiving of Cassandra because, unlike Mongo, Cassandra's major version number is 0).
Another team seems awfully happy with Hadoop, although they've been bitten by an HDFS related bug. I can't speak too much to their experience.
Memcached of course is wonderful, and the little bit I've used Redis it has lived up to its promises.
Riak is the only non-relational database I can think of that I've only heard positive things about, but I've never personally used it.
As far as I'm aware this would be no different when reimporting a dump from MySQL/Postgres. If you're not restoring from a snapshot then the indexes have to rebuild themselves.
My experience has been that MySQL backup/restore ops (mysqldump; mysql < dump.sql) are orders of magnitude slower than MongoDB backup/restore ops. We use LVM snapshotting for MySQL backups because the "normal" method is just too slow.
I just did a test dump of a 1.7GB MongoDB database:
real 0m5.781s
user 0m0.570s
sys 0m4.817s
I'm not in a position to do a restore just now, but my experience has been that restores are similarly quick. 120GB is obviously a far larger amount of data, but unless you've gone insanely nuts with the indexing, I can't imagine that it would take hours upon hours to do.
5 seconds for a 1.7 GB working set sounds really low: it would mean you could write > 200 Mb / sec assuming you could get data from mongodb infinitely fast. Also, what happens to your mongo database while you export data (can you still write to it ?).
In any case, mysqldump is rarely the bottleneck - restoring is. For the latter, you really want to use mysql from percona, because their version of mysqldump has an option to void the data without index first, and create the index later (--innodb-optimize-keys=True). The effect will depend on the database schema of course, but I often see one order of magnitude difference for simple, large tables with a few simple indexes.
5 seconds does seem very fast, but this is on a high-end Rackspace server with what I'm sure are very fast hard drives and some 64GB of RAM. It's a bit slower on "normal" boxes, but still quite quick.
You can still write to your database, yes. There's a flag, --oplog, which when passed to mongodump, will record the oplog position at the time the backup starts, then dump any ops since the backup started as a part of the backup, so they get replayed on restore. This leaves your data consistent with the database state at the end of the backup without the need to issue a write lock.
If you are in a position to do so, I can't recommend enough using the Percona XtraBackup[1] tool to enable online snapshotting of MySQL. (Also, if you are able to upgrade to XtraDB I would highly recommend looking into it).
A mysqldump / restore can be dependent on many variables when discussing the rate of import. Grouped indexes, delayed FK constraints, buffer sizes, and hardware specs to name a few. My point was more about how one is going to suffer this reindexing problem regardless of data storage facility :)
That may be true. Snapshotting is definitely superior for disaster recovery regardless of your database. I hope I stressed that enough in my talk as well as in my slides.
This reminds me of a video I saw about scaling with ruby on a niche social network (http://zenbach.com/scaling-fetlife-com-with-ruby). It was an interesting talk, but they also tried MongoDb and moved away from it, though, they were also using an older version.
* This was our experience, YMMV. I just wanted to share our story in hopes it would help others avoid some pitfalls.
I'll try to give honest responses to questions asked. I'm not trying to be unnecessarily cynical, overly negative, or anti-10gen in my slides, talk, or comments here. I'm sure many people are using MongoDB happily, and I've heard 10gen's support is outstanding.
A good part of your presentation should have been what your requirements are.
For example: MongoDB's schemaless feature has a downside that means more data as a key must be defined individually. Great if you constantly need to fix/change the schema. A structured schema solution obviously has a space saving.
Kind of a side topic, but lazy, JIT migrations work well for this. You keep a "migration chain" in your fetch layer that goes v1 -> v2, v2 -> v3, and every persisted doc is tagged with the schema version at the time of persistence.
Then, your binding code just runs it through whatever subset of the chain brings the document to v-latest. If the document is re-written, it will (naturally) be re-written in updated form, and you'll only pay the upgrade penalty once (or, not at all for cold data).
It sounds kinda creepy to old-school schema theorists, but in practice it actually works pretty well (provided you always use this library to retrieve your objects.)
Using this method, I've never done a stop-the-world schema migration on a schema-free system, and I don't think I ever would... otherwise, you're tossing out one of the major benefits of schema-free: no migrations! If you're going to be rigorous about full-collection schema consistency, you maybe should have just used a proper RDBMS to begin with...
I am trying to come up with a simple way of automatically handling data and schema migrations across a database that spans client and server treating both as masters. It just seems incredibly difficult to get right. Your idea of handling migration at read time sounds promising.
Yep--and just to reassure you, it's more than an idea on my end... I designed and maintained a critical system a few years ago on a large K/V store that used this method, and it worked without a hitch. We never even needed to think about the fact that the documents within the store were comprised of various schemas from several generation of the application (except when we wrote and tested the next incremental link the upgrade chain).
What filesystem did you end up using? Ever looked at GlusterFS (http://www.gluster.org/)? From an (admittedly unscientific) comparison of the various free/open-source distributed filesystems, it seems the best to me.
Ceph is still in beta, and the fact that Lustre only supports a single Metadata Server is really scary to me.
One particular part of that slide deck stuck out, and b/c there's no accompanying audio to go with it I thought I'd mention it. Towards the beginning of the deck in one slide he says 'MongoDB is fast.... until your data / indexes no longer fit in memory'. Later on there's the '120GB of data in MongoDB became 70GB of data in Postgres'. The next slide talks about using short key names + a key-mapping layer to preserve space.
While using longer key names in your objects will lead to larger objects / larger storage requirements, indexes do not contain the names of the keys of the documents they are created on, so if I have 10M documents in a collection and an index on a field 'ts', that index has the same size as if the field were named 'timestamp'. It's only the storage of the object that grows in size, and by the time you are handling 120GB of data it's very reasonable to imagine you would be using Memcached in front of MongoDB to obtain the actual objects.
We used verbose key names and paid for it. We even failed to override _id with a useful primary key. So... not great.
That being said, our data is not completely normalized in PostgreSQL, so it's not a totally unfair comparison.
It may be possible to simulate an apples to apples comparison (small keys in mongo + perfectly normalized pgsql tables), but I'm not sure that would be a useful indicator for real world use cases.
Suffice it to say: repeating your schema in every row/document is going to take more space than not repeating it.
Oh no doubt, and I think naming conventions is something that takes practice over time in this situation - too verbose leads to 'wasted' space, too little leads to making it near impossible for someone to view the database. Since indexes don't store the key names that does alleviate part of the problem as well.
I tend to leave the default _id in place as it also acts as a 'created_on' timestamp, saving the need to add an additional field.
Thanks for the deck. It's always good to hear about situations that didn't work out well and what can be learned going forward (though I do think you'd have a better experience using the 1.8 branch than you did with 1.4).
The real shame is that this is still something the programmer has to worry about in first place.
When you're a database in 2011 then you should have a damn good reason for asking programmers to space-optimize their identifiers. I can't see such a reason in Mongo really.
I vaguely remember 10gen brainstorming some ideas for fixing this, but it's a pretty poor design decision for a database you're advertising for use in "Big Data" scenarios. (Not that our measly 120 GB db constitutes big data... but even at Medium Data something as silly as key sizes can have a significant performance impact).
Some/M=most Mongodb client libraries can do this for you. I can see arguments both ways to keep it in the server or keep it in the client library. 10gen does a hell of a job supporting a wide variety of very good client libraries.
Thanks for this. I've heard lots of good things about MongoDB (and auto-sharding makes me drool), but I've heard rumours of problems with it too. Always cool to read something from the other side of the fence.
Too bad that there's no other product out there that makes automatically sharding large amounts of data quite so easy...
Yeah, I've considered other options. Riak seems really nice, actually - I'm going to be trying that out next. I do like MongoDB's support for Windows, though. I'm also hesitant to use Cassandra, considering all the trouble that Reddit's had with it [1].
Reddit's initial problems were from being woefully underprovisioned, which they acknowledged.
Here's what David King [ketralnis] said six months later: "Running any large website is a constant race between scaling your user base and scaling your infrastructure to support it," said David King, Lead Developer at Reddit. "Our traffic more than tripled this year, and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources. Cassandra v0.7 represents the real-life operations lessons learned from installations like ours and provides further features like column expiration that allow us to scale even more of our infrastructure."
Reddit also just uses it as a dumb K/V store and don't take advantage of the column-store model. In theory, all of the comments for a story could be stored as columns in a column family. While this would reduce the write throughput to something like ~1000/second -- for a single story -- because partition granularity is at the row level, within a row, all the columns in a CF are stored sequentially on disk, and it also means only a single node would need to be interrogated to get the data, rather than having to broadcast a massive multi-get across the entire cluster.
EDIT: the column keys are also returned in sorted order, so this could be used to pre-sort the comment list by it's optimal access pattern.
Arguably, with a data set of any serious magnitude, there's no such thing as ad-hoc. As soon as an RDBMS is sharded, it loses most of it's ad-hoc powers.
That being said, Riak Search allows Riak to perform ad-hoc queries and Cassandra now supports secondary indices along with a SQL-like query language called CQL. Of course, you'll have to pre-plan your indexes with either of those options.
"As soon as an RDBMS is sharded, it loses most of it's ad-hoc powers."
That's an overly broad statement I would avoid. We only need to query across our various PostgreSQL partitions for offline batch operations. Our application code has yet to feel restricted by the manual sharding.
Secondary indexes - or rather: not restricting a query to a single partition (using your primary key) - is going to be expensive in any distributed system.
You lose aggregates and joins, arguably the most powerful parts of an ad-hoc query system (business intelligence kind of stuff). Without those features, you must move a great deal of these concerns into the application layer (or whatever manage your shards), even if it's just rolling up aggregates across shards. This works, but again, at the loss of the real advantages of the database here as you've got to do it yourself.
I agree with your second assertion. Secondary indexes should be avoided, but are a useful tool if absolutely needed. I've found that Cassandra's richer data model really allows them to be avoided in most scenarios.
I think a huge issue that he somewhat glazed over is that he is refusing to use auto-sharding. That is a pretty large benefit that MongoDB provides. Oh, and he refuses to upgrade his 1 year old build (1.5, now at 1.8) - no explanation of why but he was willing to switch the entire stack to PostgreSQL.
We made the switch before 1.8 was released, and made the decision during the 1.6 cycle, give or take. Auto-sharding never worked reliably in our tests. I think schmichael does a good job explaining the new features that would have helped -- spares indexes, in particular -- but the core issues still remain (lock contention, no non-blocking online compaction).
For 2.0, there is some major work on locking being done (concurrency will be an ongoing theme in 2011) and online compaction has been implemented (although you have to kick it off, it's not yet automatic).
According to the docs, this is a blocking operation -- so it's inplace, but I wouldn't consider it to be online, as I would PostgreSQL's vacuum or Cassandra's major compactions:
http://www.mongodb.org/display/DOCS/compact+Command
Let's also consider that MongoDB's entire I/O architecture and data structure layout is built around a global lock and in-place updates, which makes implementing non-blocking compaction nearly impossible without either a MAJOR overhaul of the codebase or if somehow they were able to rig up the replication system to internalize the advocated compact-on-replica-set-slave-and-then-rotate method.
PostgreSQL has an automatic cost-based vacuum and sophisticated adaptive free space and visibility maps. With log-then-checkpoint writes and MVCC, there's no interruption of the database at all to reclaim page space used by deleted rows.
Cassandra also has a write-ahead log, and flushes it's in-memory representation of data to immutable files on disk, which actually was a pretty brilliant decision. Dead rows are removed during a minor compaction, which can also merge data files together. The compaction is very fast because the files are immutable and in key-sorted order, making it all lock-free, sequential I/O.
In theoryland(tm), in-place updates seem better because you're only writing data once. In realityworld(R), the need for compaction, cost of random I/O, data corruption problems, and locking make it suck.
This is a really fascinating presentation, and an easy read. The major conclusion is to think really carefully about abandoning a SQL database for noSQL. A little more info on the postgres solution and their experience with replication would have been good.
Horizontal scaling is key. If we had more faith in auto-sharding in MongoDB 1.6, that may have worked (however horizontal scaling wasn't our only issue with MongoDB).
With PostgreSQL we manually partition the data (luckily we have a really nice key to partition on).
PostgreSQL 9's streaming replication has been adequate so far. We've only been using it in production for a few months and recently moved off of AWS* which drastically reduced our error rates overall.
Totally off-topic here, but have you noticed that the body text size of your company's blog stays really small even when trying to zoom in? I've tried zooming in using Safari and Chrome, and with both of them the text stays so small that it's difficult to read.
a) Lack of online compaction (causing data set to be much bigger than necessary)
b) Broken auto sharding (complicating horizontal scaling)
c) "Scary" durability and scaling stories with auto sharding
Looks like auto-sharding has had a lot of work done, while online compaction still isn't there.
But in particular I'm curious about (c). What exactly is so scary durability wise with auto-sharding? Is it just the fact that it didn't work in 1.5/1.6?
"But in particular I'm curious about (c). What exactly is so scary durability wise with auto-sharding? Is it just the fact that it didn't work in 1.5/1.6?"
Yes. During the 1.5 dev cycle we ran out of time. Our main database server couldn't vertically scale any further, so we had to do something. Despite 1.6.0 being released before our migration finishing, auto-sharding still didn't seem to be anywhere near the stability we needed.
I've heard some horror stories regarding auto-sharding & replica sets (in 1.8) since then that sadly I'm not at liberty to share.
I think you missed a number of finer points as well:
* We were not guided to make the best schema decisions initially, and migrating schemas is tricky and manual.
* Some schema decisions exposed us to short-comings in MongoDB (the double updating issue).
* The global write lock is severely limiting.
* It's unsafe by default (drivers default to not checking for errors when writing, among other things)
* The ops load is greater for auto-sharding+replica-sets than for manually sharded master/slave setups (granted the former has features the latter does not).
Perhaps the most frustrating part was the constant unfortunate surprises. There are stories that didn't even make it into the slides. Many you could blame on us. Some we could blame on EC2. However, strong benefits to using MongoDB over PgSQL never materialized, so we opted to switch to a system with far fewer unknowns for our fairly modest needs.
If you think fully utilizing MongoDB's architecture will give your business a competitive advantage: go nuts. That was far from our experience.
Thank you, that's very interesting. "Going nuts" with MongoDB is exactly what I'm contemplating right now at my start up, and this is something I haven't quite heard before.
Is there more info on the double update issue anywhere? It sounds like a really bad race condition.
In the PostgreSQL -> MongoDB direction only. While I understand the reasons for implementing this first (nested document replication to an RDBMS is a bigger task), I am hoping that it will see the reverse direction someday.
Using MongoDB as data collectors on the edge, replication to a central RDBMS for use with established BI tools would be a win.
Etsy.com unveiled a new gallery list system, the Treasury last year and published a couple of articles about their use of Mongo. Apparently they had some problems and rewrote the backend recently... haven't heard what yet. Hopefully they'll publish their experiences in their dev blog after the dust settles.
It seems a bit disingenuous to post slides now reviewing a year old version of MongoDB that is almost 3 major revs behind.
These days auto-sharding works and works well. There is a learning curve to optimize it (i.e. picking a good shard key, analogous to picking a good _id field) but that is to be expected.
Replica sets work exactly as advertised in my experience. Automatic fail over and no need to wake up at 3am is a huge win.
Finally single server durability directly addresses durability concerns that many have.
The first two points are the reason why my company and so many others are switching even if that means needing to understand how to pick good ids/shard keys and having to shorten our field names by hand. I can see why, if they were not available at the time, MongoDB might not be a good choice but the fact of the matter is that they are available today.
I think you may be over generalizing here. Yes there are vast similarities between MongoDB and other NoSQL databases but many of the issues he pointed out here are specific to MongoDB (and certain versions of it at that).
If the port to SQL was so easy and used half the storage and didn't appear to impact performance - it was possibly the better solution in the first place.
Mongodb is awesome - one of best technologies I've played with in a long while - but don't be too quick to write SQL off - especially if you have a lot of relationships.
Off topic - prototyping in Rails with Mongoid and the schemaless Mongodb underneath is frictionless - it's a step change in speed from prototyping with SQL.
Would you happen to have any anecdotes of a project or dataset that MongoDB worked particularly well for you?
I'm working on a project where documents seem to work particularly well, though the core of the project is still in MySQL. I decided to give Mongo a shot for what I'm working on, but posts like this are a reminder to constantly be evaluating the technology you choose to use.
I use mongo for everything but the money (where cash is not involved) and for things not needing transactions. Have been sticking to MySQL for everything else (now moving to XtraDB).
For big big big projects and pure nosql I go with Cassandra (as long as you can feed it enough RAM). Otherwise, XtraDB can go the distance.
I almost regret including that chart because a lot of people are citing it out of context without explanation. That chart illustrates a very extreme case, and it makes an exaggerated point so that people know to be conscious of working set size vs RAM.
FooBarWidget made some very valid points about working set elsewhere in these comments. It's important to note that any database will suffer when data spills out of RAM. The degree to which it suffers varies in different workloads, DBs, and hardware.
Databases like PostgreSQL do better in some cases because they have more granular/robust locking and yielding, so disk access is handled better and generally leads to less degradation in performance.
For example, in MongoDB, when you do a write, the whole DB locks. Additionally, it does not yield the lock while reading a page from disk in the case of an update (this is changing). So if you spill out of RAM, and you update something, now you need to traverse a b-tree, pull in data, and write to a data extent, possibly all against disk, while blocking other operations.
In other databases, they might yield while the data is being fetched from disk for update (or have more granular locks), so a read on an entirely different piece of data can go through. That other piece of data might reside on another disk (imagine RAID setups), so it is able to finish while the write is also going. Beyond that, MVCC DBs can do a read with no read lock at all.
Yes, MongoDB lacks this robustness, and it's why it suffers worse under disk bound workloads than other databases. However, 10gen is very aware of this, and they are making a strong effort to introduce better yielding and locking over the next year.
Regarding indexing and working sets, FooBarWidget points out that there are very naive ways to approach indexing that lead to a much larger working set than is necessary. With very simple tweaks you can have a working set that only includes a sub-tree of the index and smaller portions of data extents. These tweaks apply to many DBs out there.
The classic example is using random hashes or GUIDs for primary ids. If you do this, any new record will end up somewhere entirely arbitrary in the index. This means the whole index is your working set, because placing the new record could traverse any arbitrary path. Imagine instead that you prefix your GUID with time (or use a GUID that has time as the first component). Now your tree will arrange itself according to time, and you will only need the sub-trees for the data that has time ranges you frequently look at. Most workloads only reference recent data, so this helps a ton. You’ll still find yourself doing stuff like this elsewhere in other databases to improve locality.
What’s really interesting is when you look at MVCC, WAL log systems and how they compare. They get a huge win by not needing locks to read things in a committed state. They also convert a lot of ops that involve random IO into sequential IO, but you pay a price somewhere else, usually needing constant compaction on live, query handling nodes. Rbranson elaborated really nicely on this stuff:
Other DBs also win in some ways by managing their own row cache. Letting the OS do it against disk extents means holes in RAM data. If you handle it yourself, you focus on caching actual rows and not deleted data. Cassandra does this.
Once MongoDB addresses compaction, locks, and yielding, it will start to be very competitive.
Keep in mind that I think there's something to be said for having native replica sets and sharding. I can say that it definitely does work in 1.8, we use it all over the place. With the recent journaling improvements, durability is also better, but even without it our experience has been more than adequate. Secondary nodes keep up with our primaries just fine. In the rare case where a primary falls over, we lose maybe only a handful of operations, which for our workload is acceptable.
Overall, we're happy with our choice of MongoDB, it's already doing what we need. With the improvements coming down the road, I think it'll be a major force in a year from now, so I'm also excited for what the future holds.
The problem is that presentations are short-form jot notes that are meant to be supported by a delivery as well. The vast majority of the value comes from what's not written down.
What's worse is that the "signal to noise" ratio detracts from my ability to understand what you're really going on about.
Some examples:
You say "MongoStat is Amazing" why? how? what does it do? (I've personally played with and used it a little, but I'd like to know why you're amazed by it. How can the things you've learned improve my practice?
You mention starting with 1.2 - does that mean you're not using 1.8? I've used 1.2 and was suffering 80% - 99.9% CPU load, as soon as I updated to 1.81 the load dropped to closer to 0.03% for most of the same operations. Does that mean your talk is invalid? Obviously there's a lot more to your story that I can get from your slides.
All things considered though I'm no trying to come off as a jackass. Normally when I see a HackerNews->slidedeck I just get frustrated and annoyed by things like the "inside joke" and "cute kitten slides" that rely on the speaker to deliver some insightful punchline. In this case I did read one or two interesting points.
Though I find I'm still suffering from the signal-to-noise problem that accompanies all these decks, this was better than most. I'd love a bit more narrative around the points that matter and a bit less around things that don't (how to query in mongo isn't really necessary in a story of why you abandoned it, is it?)
Anyways - was that a rant, review or stream of consciousness?
TD;DR Thanks for the slide deck; I hate slide decks; Yours was less terrible than most.