Disqus: Scaling the World’s Largest Django Application (2010) [video]

zeeg · on Nov 22, 2012

As these slides are very old, here's some updates:

* We use Flask and nginx in various areas now (the main app is still Django). Our realtime app, for example, is powered off of uwsgi and Flask.

* There are nearly 1b monthly uniques across the network serviced by the platform.

* ~300 servers

* Still Postgres (with Slony, multiple clusters), Redis, Memcache, and some Cassandra for newer things (not comments).

Also mostly confident we're still the "largest Django app" in terms of traffic.

recuter · on Nov 22, 2012

Interesting.

Care to update on the other stats like requests at peak? Also, I wonder - if you're at 1b uniques is there anywhere to grow from here other then more engagement?

Oh, also: Would you say the 1% contribute, 9% curate, 90% consume heuristic holds true in your experience at this scale or is it even more skewed?

reinhardt · on Nov 22, 2012

Virtual servers or bare metal? What do you use for herding them?

_qcmz · on Nov 22, 2012

I second that. Would be interesting to know a bit more about the systems setup.

Also why the decision to use an ORM? To manage the partitions?

zeeg · on Nov 22, 2012

Eventually you need abstraction. Python is fast enough, as is Django, that the abstraction costs us less than the value it provides us.

That said, we do manage our partitions (currently) using an ORM layer (at the application level). We do want to break this out into a proxy middleware at some point though.

_qcmz · on Nov 24, 2012

Right, just interested in the choose of such an anti pattern.

What proxies are you considering?

lewispb · on Nov 22, 2012

Largest? Is Disqus really 'larger' than Instagram?

simonw · on Nov 22, 2012

I think it was in 2010 when they gave this talk.

saevarom · on Nov 22, 2012

Yeah these figures are really outdated. This one from PyCon 2011 is newer: http://www.slideshare.net/zeeg/pycon-2011-scaling-disqus-725...

tangue · on Nov 22, 2012

Interesting : According to these slides [14] there is no difference between Apache + mod_wsgi and Nginx + uWSGI

mibbitier · on Nov 22, 2012

Also "large" is fairly meaningless when you don't include a measure of complexity.

Sharing images is a fairly simple task.

zerop · on Nov 22, 2012

All with apache and haproxy? No nginx or uwsgi/gunicorn or Redis? How old is this article?

recuter · on Nov 22, 2012

In the newer (2011) slides they mention growth by ~100 Million comments in 6 months. Most comments are tiny, I'd be surprised if this whole dataset was much larger then ~100GB.

I don't really understand why Disqus and Reddit for that matter don't just switch to redis. There has got to be a good reason because I can't think of one. Sounds like you could run the whole thing from just one box and only have slaves for redundancy.

Why isn't vertical scaling more in vogue?

bretthoerner · on Nov 22, 2012

If you really think that's the case then you don't understand Disqus/Reddit OR Redis.

recuter · on Nov 22, 2012

Well I opened with "I don't understand". I'm not saying that they could, I'm saying I don't understand why they can't.

I'd love for you to elaborate. As far as I can tell they are not using PostgreSQL as a relational database, but rather, as a column store. So why not use Cassandra or even Redis (if the amount of data can totally fit into RAM easily, maybe it can't)?

In fact I think Reddit moved to Cassandra... anyway, I am not an expert, I'm asking.

bretthoerner · on Nov 22, 2012

I can't speak much to Reddit. I know they moved some things to Cassandra, but as a user I'll say I haven't been impressed with their latency and uptime since.

As a developer, I can speak a bit about Disqus. I don't speak on their behalf, but I did work there for two years (ironically I was also the first to use Redis there[1]) so I can at least explain why I think using Redis for the whole site is a terrible idea. I'll also note upfront that where I currently work we use a ton of Cassandra, Redis, and very little MySQL on the side, so hopefully I won't be pegged as some kind of "RDBMS only" guy.

Anyway, some reasons:

1) Relational data. Disqus really is relational. You have users, user have posts, posts belong to a thread, threads belong to sites, sites belong to an account (imagine one account for the different CNN websites). And that's a very, very small subset of the number of tables and foreign keys involved. People don't realize how many features Disqus really has above and beyond "post a text blob to a thread."

Being able to write a query that uses joins is huge. The alternative in Redis/Cassandra is having to denormalize your data into every single possible way you may want to do a "query" on later. Oh, by the way, I promise you will forget a few ways and regret having to backfill/fix all the broken denormalizations.

Even if you don't forget to denormalize anything upfront, the biggest joke k/v and document stores ever played on the developer community was convincing them that they save development time by being "schema free". When Disqus wants to add a new feature it's often only a new JOIN/INDEX away. If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day.

2) Memory (and cost). The Disqus network is actually pretty huge. Storing the entire dataset in RAM (Redis) would cost a lot more than using an efficient DB like Postgres that is a pro at moving data between disk and RAM. Cassandra would work better than Redis here, but the other problems I list still hold.

Also, as soon as you have to break from one Redis instance to two (either to scale CPU or to live on another box to increase available RAM) you lose a lot of server-side functionality like being able to union sets, or use the embedded Lua to fake 'queries' because now you have keys that live on seperate systems. Before anyone says you should shard by "site", see my link below. I did just that, but you have to understand that Disqus is more than just "comments for my website". Say you shard by website, now how do I run a do a union across sets that involve a single user who has posted to 100 different websites? I can't. Back to backfilling and denormalizing tons of data that also needs to be resident in RAM and kept in sync.

---

I could add more, but I just realized that the linked talk probably spoke about the big sharded Postgres K/V type store that they built. Here's the thing: all of the core stuff (like from point 1) isn't stored in there. It's used when it can be, for scalability, but the majority of the app is still in a behemoth Postgres instance that is replicated many times over. As to why not use Redis for that part? I'd say because it's memory only and because Disqus has Postgres expertise. Also, it's not truly "key value" because it still has indexes for say, datetimes or post_id or site_id which make doing a lot of non-relational queries handy without having to denormalize. Now, why not use Cassandra for that? Well, I would. :)

[1] https://github.com/bretthoerner/blog/blob/master/2011/2/21/r...

recuter · on Nov 22, 2012

Ah, terrific response, thanks very much. :)

So when I said switch to Redis I meant to replace the 'big sharded Postgres K/V type store that they use' not the part where they actually use relational features of the database.

I'm always curios about the idea of scaling UP versus OUT -- like you mention, going from one Redis instance to two mucks up the waters. So why do it at all? (Maybe a year from now Redis Cluster will finally come out and solve this)

1TB of RAM is going to dip below five figures soon, I guess if you can't fit into that it is moot.

"If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day."

I would do bloom filters, but point very well taken. No silver bullets. Thanks again for the reply.

zeeg · on Nov 22, 2012

Disqus wouldn't fit into 1TB of memory as a denormalized data set.

It doesnt even fit (indexed, at least) into 1TB of memory as a normalized data set.

At the scale we're at, you're required to make tradeoffs and come up with less than standard solutions to problems. Our solution, as many others have done before us, is to shard datasets (both Redis and SQL).

recuter · on Nov 22, 2012

Congrats on the growth, that's a lot of comments! What I really want to know is how you guys solved being google bot friendly -- I have a fogy memory of a blog post or HN comment from around the time of the new version coming out that said there was something interesting that will be shared about that in the future.

zeeg · on Nov 22, 2012

All I can really say (not being on the Google side of things) is: iframes

lamby · on Nov 22, 2012

Because people like things like transactions and are generally risk-adverse when choosing storage technologies?

antirez · on Nov 22, 2012

Disqus is currently using some Redis AFAIK, but probably not to store the bulk comment data (and probably for good reasons since I can imagine an enormous difference between working set and total data set in this case of Disqus).

ihsw · on Nov 22, 2012

Sept. 7 2010, according to their slideshare link.

http://www.slideshare.net/zeeg/djangocon-2010-scaling-disqus...

hahainternet · on Nov 22, 2012

What struck me instantly was the use of Slony. I haven't listened to the whole thing yet but I am interested in their justification. Perhaps they just haven't moved to 9.2 yet.

sugarcode · on Nov 22, 2012

slony offers some compelling advantages over streaming replication - even on new databases we setup, we still like slony for several reasons:

* Version upgrades (streaming replication requires the same PG version between master/slave, slony does not)

* Logical replication gives us finer-grained control over how data is replicated across the cluster

* Ability to create additional indexes on slaves

slony isn't perfect and it has caused us some headaches, but its flexibility makes it our go-to replication tool for postgres.

d0ugal · on Nov 22, 2012

I don't think they could move to 9.2 in 2010 ;)

Kilimanjaro · on Nov 22, 2012

Using django for a distributed commenting system?

Hammer and screws.

legutierr · on Nov 22, 2012

Just wondering (I'm not necessarially objecting to your statement):

* what would you see as the right they of app for Django, if not this? Obviously Django is handling it, so what's the issue?

* what is the right tool to build a distributed comment system?

Kilimanjaro · on Nov 22, 2012

What is a distributed commenting system? An HTML script tag to load a static file that will ping an api that will serve you comments.

What is django? A framework with strong routing, OR-mapping, templates and admin modules.

You need none to build it. Your main concern shouldn't be a coding framework, but load balancing, caching, failover control, more sysadmin stuff than code.

Any language could do. Framework? not needed.

Now, for the backend system to control that monster, then yes, you may use django.

So, use django for complex apps that need routing, data management and UI presentation, plus a powerful admin module.

* I am a python/django developer.

tkaemming · on Nov 22, 2012

> What is a distributed commenting system? An HTML script tag to load a static file that will ping an api that will serve you comments.

This is how Disqus works. Something still has to power the API that serves the data for it and we happen to like Django a lot so that's what we use. There are also a lot of parts of Disqus that are not the embed (moderation panel, account management, etc.)

Kilimanjaro · on Nov 22, 2012

Exactly, that's my point. Not meant to demerit your great work.

The admin/crud stuff, moderation panel, account management, etc, that's what django was developed for. Great choice.

If you ask me to start Disqus from scratch again, I'd probably use django too, for the admin part, but for the API I'd go commando, closer to the metal, without framework at all. No need to load a 100MB routing/modeling/templating monster just to perform an invisible API call.

Highly optimized plain python scripts would work better.

Sometimes frameworks are more like handcuffs. Just sometimes.

zeeg · on Nov 22, 2012

We have never ever used the Django admin, and I would choose Django any day of the week for a project that is web on the web and using a database.

(In fact almost every single project I've ever built has used Django, and there's never been a limiting factor of that choice)

wahnfrieden · on Nov 22, 2012

It's pretty easy to implement yourself but Django's middleware helps a lot too even with APIs.

riffraff · on Nov 22, 2012

how is disqus distributed? All the comments go to a central service.

stef25 · on Nov 22, 2012

Painful spam in the disqus comments at the bottom of that page. Don't they have something in place against this?

beaumartinez · on Nov 22, 2012

"+1"s are hardly spam, they're just low quality comments. If they filtered those, the web would be a very commentless place

d0ugal · on Nov 22, 2012

kordless · on Nov 22, 2012

At least they haven't translated the slug into Russian. Every time I get email from them about a comment on my post, it shows up translated into Russian.

TommyDANGerous · on Nov 22, 2012

Great watch.