Hacker News new | past | comments | ask | show | jobs | submit login
Why does digg need so many servers? (twitter.com/spolsky)
64 points by latch on Oct 13, 2010 | hide | past | favorite | 61 comments



Just as perplexing, why do they need ~68 employees (http://about.digg.com/team).

My counting may be off by 1-2, but last I checked Reddit is coming up on them traffic-wise (http://siteanalytics.compete.com/reddit.com+digg.com/), and only has 7 employees? And I'm pretty sure Reddit doesn't have 500 servers either, although I could be wrong.

Both sites offer very similar functionality.

While past its peak, Slashdot only has 18 servers. (http://slashdot.org/faq/tech.shtml#te050)


My understanding (which is, to say, belief built up over watching the companies from outside) is that Reddit is currently running what could best be described as a skeleton crew: Conde Nast won't fund them, so they're just trying to keep the ship afloat and don't have time for anything else.

Digg has scaled to the size of a company that, for a couple of years now, has considered itself just on the tipping point of "something big happening." As in there being a sea change in the way everyone consumes their news, and Digg is at the center. You'll need lots of employees when that happens, right? I think Digg has partly grown fat due to non-existent leadership; the struggle between Jay Adelson and Kevin Rose was been well-documented, and having your two lead guys check out for at least a year is not a good way to run a business. A lot of the people there are "sales", which I guess helps Digg stay profitable, but I don't really think you need as many as they have. Five community managers on a community that is supposed to manage itself is also excessive.

I expect the truth of the matter lies somewhere in the middle of Reddit and Digg. More than Reddit so you're not stagnant, less than Digg so you're not fat.


Everything that is cool about Reddit at this point is what Redditors are doing with Reddit, not what Reddit is doing with its site. And isn't that how social media is supposed to work? Not just some notion that users are tools, that you'll crowdsource your way to being yet-another-traditional-media source (except without journalists)... social media should be individuals forming their own communities... communities of interest, communities of practice, communities of support -- all of which exist in Reddit (/r/SuicideWatch both disturbs and impresses me).

So the fact Reddit-the-software isn't changing that much doesn't seem like that big an issue, it's more like infrastructure. A major redesign would be negatively disruptive to the communities that are building there. Not that there aren't great things Reddit could do but isn't, but I think they are doing well at the most important stuff. If Reddit had 68 employees they not only would be fat, they'd probably fuck up a good thing.


Do not take my comment to mean I'm anti-Reddit, I like Reddit way more than I ever did Digg :) But Reddit doesn't really do a lot apart from keeping everything going.

I agree that altering Reddit now is probably not a good idea, and they don't need a large headcount, but it's good that Reddit isn't answering to shareholders directly, because if it wasn't for Digg's implosion, their growth would have been far too slow.


But that's the funny thing, is the Reddit is growing and Digg is shrinking- despite their staffing. Reddit does a much better job at allowing the community to manage themselves on a volunteer basis with the moderators/admins of the subreddits than Digg.


Yes, it blows my mind that digg needs so many employees. Reddit actually is larger than digg (http://www.readwriteweb.com/archives/reddit_to_mainstream_me...), and until recently had even less than 7 employees. If I'm not mistaken they have 3 programmers and a systems admin. Kind of makes you wonder what all the digg employees do all day.


"Kind of makes you wonder what all the digg employees do all day."

Communication. I'm always surprised that so few people who do software (I'm talking management at companies) are familiar with the mythical man month. Adding employees to a software project always yields diminishing returns due to the increase in communication required.


My guess is that Reddit won't be able to innovate much with that level of staffing, they are basically just keeping it running - but maybe that's ok, since they have already been bought. Slashdot has been the same for a long time too, but they are also not looking for a big exit.

Digg has had $40m in funding [http://www.crunchbase.com/company/digg] and so unless it sells for $200m+ the investors will most likely see it as failure. Kevin Rose seems to be advising people to take less money [http://www.zdnet.com/blog/weblife/kevin-rose-10-tips-for-ent...] or no money, presumably because Digg was over-funded, at least for it's current position.


Absolutely. Every time someone trots out "And Reddit only has 7 guys", I think to myself that they're innovating as if they had 3. Say what you will about Digg, but at least they rolled out new software.


Say what you will about Digg, but at least they rolled out new software.

Digg is a discussion site, not an operating system -- what is the value in "rolling out new software"?

Digg's value is in the quality of the discussion and community; "innovation" that doesn't improve that is pointless. Look at HN: it remains nearly as minimalist as always, and mostly receives occasional tweaks and performance improvements.


Quite -- it was innovation that killed Digg wasn't it?


I'm of the view that the growth and community killed Digg.

Once Digg hit critical mass, they were killed by the signal to noise ratio. I stopped visiting the site a couple of years ago because of that.

The reason why HN thrives is that the signal to noise ratio is still quite high, and while there is diversity in the submissions, you can probably group them into a few common themes.


Sure, you can argue that they would have been better off not rolling out new software, or that they don't need new features. I'm just saying that the answer to the (sometimes implicit) question of "why does Digg have so many more people than Reddit" is that some of them are designing features and writing code, and there doesn't seem to be much of that going on at Reddit.

Personally, I'm in general agreement with you. And you can rebase the whole argument: take MetaFilter, with 4 people on the payroll (and not all technical). What is everyone at Reddit doing?


I would agree that I love HN's minimalist approach, but the end game for HN and Digg are vastly different, aren't they?

I think the perceived value of rolling out new software is that Digg had hoped it would draw in more uses and entice existing users to become more active. If you're running a website that is largely user generated content and is also stagnant, how do you inject new life into it without adding new features? I'm genuinely curious.


Where is Kevin advising people to take less money? If you mean the $80 million comment, that was almost certainly before the Series C, so digg had less than half the funding.



For someone who took $ x million "off the table," during those funding rounds the hypocrisy is impressive.


Since when did "experience" == "hypocrisy"?


Huh? Kevin sold several million dollars worth of stock, enabled by raising more money than needed, and gave up more control to do it. Now he advocates that others raise as little money as possible. How is that "experience?"


reddit uses something like 250-300 EC2 nodes.


Unless they are using the really big instances though, I find that EC2 nodes are generally lower in power compared to what most people are using for their servers.

I remember not too long ago before they moved to EC2 everything fit in just a few racks. Of course, their traffic has expanded significantly recently.


EC2 nodes are terribly underpowered, particularly in the I/O realm (the biggest weakness of virtually every site) where a standard desktop absolutely annihilates the I/O performance of an S3 connection (where you have to perform software RAID to get anywhere near the performance of a single magnetic low-end SATA drive, which is absolutely terrible).

A lot of the "web scale" noise has come from dealing with these gerbil sized instances.


The difference is that digg has network information (IE: Followers). Stack overflow doesn't.

Think through the ramifications of each of these page flows:

1. Give me this question and everybody who commented on it.

2. Give me every post that this user is following has made over the last n days.

That's why.


I think this is exactly it. You can't compare a comment thread on Digg to a Question on SO.


That would increase the load on the database. Why does that increase the load on the webservers?


It depends how you approach the problem. I'm guessing here, from my own experience implementing social graph features, but here goes:

There are a couple of ways you can go about this, the first is the database: Join the network of people against the land of content and bring it back. This doesn't work (as an aside this is what people mean when they say web scale, it has nothing to do with web traffic, it's social graphs) your database will cry. All though not at first, in development it works fine, and you feel fine, and for a while you're ok, but you start growing....

Another way you can go about it is by denormalizing. In this world you store a pointer to each content item for each user. So anytime I do something all the people [following|watch|connected|friended] to me get a record indicating I did this. This works, but now you have lots of data (lots and lots of data!) spread all crazy around. You need some kind of system to push that data out to everybody. It's those last two that drive up your hardware usage, it's not necessarily web boxes, but it's boxes in the background broadcasting the events out to the world, and the datastores to hold it all. Depending on how your web code works you could also have a lot of overhead on the webservers putting all that stuff together.

My experience here comes from building the social features into toolbox.com. A good example is this page http://it.toolbox.com/people/george_krautzel/posts-connectio... That's all the posts from users connected to our CEO (all 750k of them). Getting that to return in near real time is super fun (and you can probably tell that I went down the DB join path before it all fell apart).


Where does the "500 servers" number come from?

Are they compute nodes? Are they webservers(5)? Webservers and appservers(10)? Web, app, database(12)? Web, app, database, back-end processing(15)? How much redundancy is built in(30)? Staging environments(60)? Development environments(120)? Standard IT infrastructure like DNS and email(125)?

"We run over 500 servers" sounds a lot like "we have 500 servers in our datacenters" not "it takes 500 servers to handle 200MM page views."


I used to work at digg as a engineer. I won't say your distribution numbers are correct, but you've got the gist. There are multiple environments, reserved capacity, research and testing machines, etc.


Pretty sure those numbers discount serving up Digg information out to every blog that includes a Digg button as well. Pretty sure if SO adding something like it, they'd be running on more than 5 servers.


My last job served around 30-40 million pages a month on 4 servers that used ASP.Net and SQL Server 2000. The only server whose CPU utilization was over 20% was the database server. I'm not sure the CPU utilization on the web servers ever went above 10%.

A friend of mine's company has a tough time keeping up with 500,000 page views a month using more equipment. They're using PHP and MySQL.


Apple and Oranges comparison really.

- What do the sites do?

- How optimised are they?

- How much static/cachable content is there?

Really there is too little data for this comparison.


Both are simple sites by any definition. The database queries for almost all the pages views are clustered index seeks (on SQL server) and in MySQL they are mostly looking up on the primary keys for the tables used (these are MyISAM tables so no clustered indexes here).

Both are equally unoptimized outside of the database queries which I believe are fairly well optimized.

No real cacheble content on either outside of the usual suspects (JS, CSS, some images)


And Michael Schumacher with a compact can beat you on a sports car.


Not in a drag race where he's driving a Yugo and you're driving a Ford GT. Unless you crash of course.


What types of servers in terms of CPU cores and RAM? Microsoft's licensing structure makes it much cheaper to scale up instead of scaling out.


I serve around 50 million requests a day using 4 servers on a system built using PHP and MySQL on one of my services.

Requests/Pages per server is a bad metric to determine efficiency when every persons problems are so different.

If your utilization is so low why do you have the additional servers?


This is meaningless without the necessary information to evaluate the situations.

Apples and orangoutangs.


So, do you get free coffee in the cantina at Microsoft?


I'm not sure what you are suggesting?

Free coffee would be enough to get me to say something nice about Microsoft products?

I work at Microsoft?


Why does it take 20 feet of land to grow a gallon of orange juice and 200 feet of land to grow a gallon of apple juice?


A site that posts links, has voting and comments vs a site that posts questions has voting and comments is hardly apples vs oranges.


One thing is for sure, Spolsky sure knows how to scale his ego.


I'm not sure what that's supposed to mean.


Maybe Digg uses small servers whereas StackOverflow uses big servers?


Using big servers would be what Microsoft's licensing model makes you want to do.


Non-linearity.


Even then, 500 servers? That's an incredible amount.

EDIT: Then again Facebook is using 60000 servers at the moment.


Facebook has a lot more than 100 times digg's registered users, though.


Not to mention a basically non-existent cachability profile, and an order of magnitude more functionality.


I'm assuming you mean Facebook has more functionality and no cachebility. That might be true but I wonder how expensive getting a Digg discussion(with 100-200 posts) page compared to say a 5 answer SO question or even just a simple Facebook profile.


As I understand it, a Facebook profile is anything but simple.


I mentioned this to Spolsky already but depending how you want to count servers (web servers or all front facing servers or all servers in your infrastructure?), we have 10-20 servers servicing 350 million page views per month. That is of course not counting all the ajax requests that we also service, which don't count as page views. So we service 17-35 million page views per server, compared to StackExchange's 12 million page views per server.

We use PHP and Apache, so the gloating about ASP being way more efficient than PHP seems to be unfounded.


Digg must be pluggr's biggest customer... http://pluggr.info/


Where does the "500" come from?


Kevin mentioned it somewhere a few weeks ago. I forget where though.


bad server administrators? no optimization? no cache? static content served from apache servers instead of, for example lighttpd?

there could be many reasons


What's the cost per page view in each case?


Depends on the servers to, I'm sure it would take more than 5 servers that Google run to run SO.


Digg is using Eee PCs as servers?


[Sunglasses On] Sounds like the .Net stack performs much better than the LAMP stack.


I thought people on HN are more open to technology discussion. It seems people can't handle their bubble being pierced when presented with facts.

It would be more interesting you have facts to refute my statement than downvotes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: