Hacker News new | past | comments | ask | show | jobs | submit login
MongoDB Finds A Major Adopter In Craigslist (dzone.com)
95 points by DanielRibeiro on May 16, 2011 | hide | past | favorite | 21 comments




This video is a presentation by Jeremy Zawodny, who co-wrote O'Reilly's "High Performance MySQL" ( http://oreilly.com/catalog/9780596101718 ). If there's anybody worth watching a video about switching from MySQL to MongoDB, it's that guy.


For a direct link to the 34 min video: http://www.10gen.com/video/mongosv2010/craigslist


A bit off topic, but since Jeremy seems to be in here, I've been wanting to thank you for High Performance MySQL - it's one of the best technical books I've read in recent years, with a lot of insights that had been hard/impossible to piece together definitively by reading online articles/docs alone. It has taken a lot of uncertainty out of things for me as a dabbling db admin trying to scale.


It's mind-blowing to read that they could get data in mongoDB faster than to get it out of MySQL. I wonder if their structure was so complex or if the queries were complex as well, or is mongoDB that fast? Anyway, great to see sites like this embracing new technologies and revealing the details.

[edit] I guess watching the video explains a lot more. Do watch!


Yes, it is a bit mind-blowing. The only DB boxes I can consistently get data out of at a high rate are those with Fusion-io cards in them. And we're not heavily normalized--just a handful of tables.


I recently went to a Google Tech Talk on MongoDB, and although the presentation wasn't very hacker oriented, I must say the setup of replication sets and sharding is ridiculously easy (http://www.mongodb.org/display/DOCS/Simple+Initial+Sharding+...). I started up a project using CouchDB and Redis, but was impressed enough with Mongo's scalability/ease of use that I think I might switch over to MongoDB for a bit.


The big tests for Mongo are how well it behaves under heavy load, how easy it is to shard/replicate data at large scale (e.g. what Zawodny was talking about in his presentation about Craigslist having 100 MySQL boxes), and to what degree data is recoverable when inevitable failures happen (there are some big questions here with Mongo since it appears it trades off ACID compliance for speed).

It's great to see bigger users using Mongo because that's where these tests really take place. For example, it seems Cassandra got tested this way at Facebook/Reddit/Digg etc. and didn't cut it.


Pretty much all distributed databases trade off ACID for speed, it's a fundamental issue with database design.

If you've got 20 database servers, potentially on the other side of the planet from each other, if you were to be ACID compliant throughout the cluster, each write commit would be delayed up to 500ms or more.

If you're running a pair of servers in the same room, as is normally the case with a traditional "active-active" database cluster, a network delay of 1ms often isn't a significant impact.


See http://mongodb-is-web-scale.com/ for a humorous angle on the speed vs. ACID compliance tradeoff.


Not to take away from the substance of the article itself, but is anyone else surprised that they have 2 billion "documents", which presumably means active ads/listings? That seems like an awful lot.


MongoDB is being used for historical archiving, not for the live site itself. The big reason being that changing table schemas for very large sets of old data is painful with MySQL. So the 2 billion number would be any ad/listing older than a set amount of time.

The live data is < 1 TB and is still stored in MySQL.


Exactly.

The "set amount of time" typically hovers around 60 days, though our archiving process has been off for several months while the migration took place. So we have some catching up to do--somewhere in the neighborhood of 150 million postings, last I counted.


I've been hearing some good things about Riak lately and their masterless implementation seems quite interesting. Did Riak ever make your radar and, if so, what were the disadvantages that made you choose MongoDB?

Were I to guess based on the video, I would say lack of a Perl client and you'd probably end up having to roll too many of your own solutions on top of it?


I would have expected more, personally. Craigslist is massive, popular, and been around a long time. That ads up to a TON of listings.


~2.2 billion is a ton of listings. What you have to realize is that the craigslist wasn't in hundreds of cities on day #1. In recent years, we've had tens of millions of "live" ads on the site, but it took a bit of time to grow to that size.


This looks like a data warehousing of the archive. The two billion listings probably represents all expired ads ever. There is no way they have 2 billion active ads at any one time.


The above comment is correct.

The archive does have to be accessed by users though, since users can access listings from many years back.

The entire archive seems to be under 4 TB from what he described in the video (2 billion documents at 2 kilobytes each). They do not retain photos.


Yup. You hit the nail on the head.


How much photo data do you handle? How long do you keep it?


The photos are removed once the posting is no longer live on the site (roughly). As for how many, I'd have to dig a bit to find that out...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: