Care to update on the other stats like requests at peak?
Also, I wonder - if you're at 1b uniques is there anywhere to grow from here other then more engagement?
Oh, also: Would you say the 1% contribute, 9% curate, 90% consume heuristic holds true in your experience at this scale or is it even more skewed?
Eventually you need abstraction. Python is fast enough, as is Django, that the abstraction costs us less than the value it provides us.
That said, we do manage our partitions (currently) using an ORM layer (at the application level). We do want to break this out into a proxy middleware at some point though.
In the newer (2011) slides they mention growth by ~100 Million comments in 6 months. Most comments are tiny, I'd be surprised if this whole dataset was much larger then ~100GB.
I don't really understand why Disqus and Reddit for that matter don't just switch to redis. There has got to be a good reason because I can't think of one. Sounds like you could run the whole thing from just one box and only have slaves for redundancy.
Well I opened with "I don't understand". I'm not saying that they could, I'm saying I don't understand why they can't.
I'd love for you to elaborate. As far as I can tell they are not using PostgreSQL as a relational database, but rather, as a column store. So why not use Cassandra or even Redis (if the amount of data can totally fit into RAM easily, maybe it can't)?
In fact I think Reddit moved to Cassandra... anyway, I am not an expert, I'm asking.
I can't speak much to Reddit. I know they moved some things to Cassandra, but as a user I'll say I haven't been impressed with their latency and uptime since.
As a developer, I can speak a bit about Disqus. I don't speak on their behalf, but I did work there for two years (ironically I was also the first to use Redis there[1]) so I can at least explain why I think using Redis for the whole site is a terrible idea. I'll also note upfront that where I currently work we use a ton of Cassandra, Redis, and very little MySQL on the side, so hopefully I won't be pegged as some kind of "RDBMS only" guy.
Anyway, some reasons:
1) Relational data. Disqus really is relational. You have users, user have posts, posts belong to a thread, threads belong to sites, sites belong to an account (imagine one account for the different CNN websites). And that's a very, very small subset of the number of tables and foreign keys involved. People don't realize how many features Disqus really has above and beyond "post a text blob to a thread."
Being able to write a query that uses joins is huge. The alternative in Redis/Cassandra is having to denormalize your data into every single possible way you may want to do a "query" on later. Oh, by the way, I promise you will forget a few ways and regret having to backfill/fix all the broken denormalizations.
Even if you don't forget to denormalize anything upfront, the biggest joke k/v and document stores ever played on the developer community was convincing them that they save development time by being "schema free". When Disqus wants to add a new feature it's often only a new JOIN/INDEX away. If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day.
2) Memory (and cost). The Disqus network is actually pretty huge. Storing the entire dataset in RAM (Redis) would cost a lot more than using an efficient DB like Postgres that is a pro at moving data between disk and RAM. Cassandra would work better than Redis here, but the other problems I list still hold.
Also, as soon as you have to break from one Redis instance to two (either to scale CPU or to live on another box to increase available RAM) you lose a lot of server-side functionality like being able to union sets, or use the embedded Lua to fake 'queries' because now you have keys that live on seperate systems. Before anyone says you should shard by "site", see my link below. I did just that, but you have to understand that Disqus is more than just "comments for my website". Say you shard by website, now how do I run a do a union across sets that involve a single user who has posted to 100 different websites? I can't. Back to backfilling and denormalizing tons of data that also needs to be resident in RAM and kept in sync.
---
I could add more, but I just realized that the linked talk probably spoke about the big sharded Postgres K/V type store that they built. Here's the thing: all of the core stuff (like from point 1) isn't stored in there. It's used when it can be, for scalability, but the majority of the app is still in a behemoth Postgres instance that is replicated many times over. As to why not use Redis for that part? I'd say because it's memory only and because Disqus has Postgres expertise. Also, it's not truly "key value" because it still has indexes for say, datetimes or post_id or site_id which make doing a lot of non-relational queries handy without having to denormalize. Now, why not use Cassandra for that? Well, I would. :)
So when I said switch to Redis I meant to replace the 'big sharded Postgres K/V type store that they use' not the part where they actually use relational features of the database.
I'm always curios about the idea of scaling UP versus OUT -- like you mention, going from one Redis instance to two mucks up the waters. So why do it at all? (Maybe a year from now Redis Cluster will finally come out and solve this)
1TB of RAM is going to dip below five figures soon, I guess if you can't fit into that it is moot.
"If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day."
I would do bloom filters, but point very well taken. No silver bullets. Thanks again for the reply.
Disqus wouldn't fit into 1TB of memory as a denormalized data set.
It doesnt even fit (indexed, at least) into 1TB of memory as a normalized data set.
At the scale we're at, you're required to make tradeoffs and come up with less than standard solutions to problems. Our solution, as many others have done before us, is to shard datasets (both Redis and SQL).
Congrats on the growth, that's a lot of comments! What I really want to know is how you guys solved being google bot friendly -- I have a fogy memory of a blog post or HN comment from around the time of the new version coming out that said there was something interesting that will be shared about that in the future.
Disqus is currently using some Redis AFAIK, but probably not to store the bulk comment data (and probably for good reasons since I can imagine an enormous difference between working set and total data set in this case of Disqus).
What struck me instantly was the use of Slony. I haven't listened to the whole thing yet but I am interested in their justification. Perhaps they just haven't moved to 9.2 yet.
What is a distributed commenting system? An HTML script tag to load a static file that will ping an api that will serve you comments.
What is django? A framework with strong routing, OR-mapping, templates and admin modules.
You need none to build it. Your main concern shouldn't be a coding framework, but load balancing, caching, failover control, more sysadmin stuff than code.
Any language could do. Framework? not needed.
Now, for the backend system to control that monster, then yes, you may use django.
So, use django for complex apps that need routing, data management and UI presentation, plus a powerful admin module.
> What is a distributed commenting system? An HTML script tag to load a static file that will ping an api that will serve you comments.
This is how Disqus works. Something still has to power the API that serves the data for it and we happen to like Django a lot so that's what we use. There are also a lot of parts of Disqus that are not the embed (moderation panel, account management, etc.)
Exactly, that's my point. Not meant to demerit your great work.
The admin/crud stuff, moderation panel, account management, etc, that's what django was developed for. Great choice.
If you ask me to start Disqus from scratch again, I'd probably use django too, for the admin part, but for the API I'd go commando, closer to the metal, without framework at all. No need to load a 100MB routing/modeling/templating monster just to perform an invisible API call.
Highly optimized plain python scripts would work better.
Sometimes frameworks are more like handcuffs. Just sometimes.
At least they haven't translated the slug into Russian. Every time I get email from them about a comment on my post, it shows up translated into Russian.
* We use Flask and nginx in various areas now (the main app is still Django). Our realtime app, for example, is powered off of uwsgi and Flask.
* There are nearly 1b monthly uniques across the network serviced by the platform.
* ~300 servers
* Still Postgres (with Slony, multiple clusters), Redis, Memcache, and some Cassandra for newer things (not comments).
Also mostly confident we're still the "largest Django app" in terms of traffic.