Why did we take reddit down for 71 minutes?

jbellis · on Jan 7, 2010

"Memcachedb also has another feature that blocks all reads while it writes to the disk."

Seriously?

Wow.

patio11 · on Jan 8, 2010

Quite seriously.

I use memcachedb in my deployment of A/Bingo (it isn't limited to that), which avoids using a SQL database because I wanted it to be blazing fast for people who having significantly higher traffic sites than I do. The typical case performance on my site is fantastic. The worst case performance is an abomination -- when it flushes the site essentially pauses for all users for five seconds.

This is tolerable for me, but only just. The reason I use memcachedb as opposed to vanilla memcached is because I was worried about persistence in the event of a server failure, but given that I can count resets of my server over 3.5 years on a single hand, I might just decide "In the event of a server failure, I lose any A/B tests in progress and have to start over. Oh well!"

davidw · on Jan 7, 2010

Sounds like it's time to look at Redis...

simonw · on Jan 7, 2010

Or Tokyo Cabinet / Tyrant, which is still a high performance key/value store but doesn't need to fit everything in RAM. Depends on how much they're storing.

strlen · on Jan 8, 2010

If the data isn't "optional" (i.e. if there's a cache miss, you can't just go to a traditional database for a somewhat higher - but acceptable - cost), "memcached with persistence" approach isn't going to cut it. You now have a distributed system with state, which is a much more difficult problem.

That's why there are so many distributed storage systems: there's no "one fits all" solution that can handle every theoretical corner case (even Google hasn't solved that). In my (biased) view, the eventually consistent stores (e.g. Dynamo-inspired ones or even "Friendfeed/Facebook model" of sharded and replicated MySQL databases storing Blobs) do seem most reasonable for web-type problems.

mrduncan · on Jan 8, 2010

Virtual memory support is currently being added to Redis so that it won't need to fit everything in RAM either. I'm sure antirez can provide some better info on how it'll work.

antirez · on Jan 8, 2010

Hello, VM is already in alpha on Git actually. There is some more work to do, but I don't think in the Reddit use case there is the need of Redis VM: they are using MemcacheDB as a persistent cache, if it's a cache it should match very high performances delivered by memcached and is not required to cache everything. Redis is as fast or faster than memcached (using clients with the same performances) and is persistent, so probably it's a good fit for this problem.

Instead of VM Reddit should use Redis EXPIRE I guess, that is, time to live in cached keys so they auto expire.

Btw for people that don't know what Redis Virtual Memory is: with VM Redis is able to swap out rarely used keys in disk. This makes a lot of sense when using Redis as a DB. When using Redis as a cache, the way to go is EXPIRE: rarely used things in cache should simply go away and be expired instead of being moved into disk.

EDIT: It would be very interesting to know where the Reddit performance problem is, but Redis Sorted Sets are a very good match to create social-news alike sites like HN or Reddit in a scalable distributed way, with a few workers processing recent news to update their score into the sorted set. The home page can be generated with ZREVRANGE without any computation.

It's a shame reddit is not sharing how this cache is used.

jbellis · on Jan 8, 2010

Well, no; if they're using memcachedb presumably their data doesn't fit in memory on a single machine.

Cassandra would be a better choice.

/cassandra committer, but it would :)

antirez · on Jan 8, 2010

When Redis is used as a persistent cache, the clients implementing consistent hashing are perfect fit for a distributed-redis, as you don't need all this data safeness. You can lose a node if there is a disaster without too much problems usually (like it happens in memcached), but you want, in normal conditions, that the cache is not volatile.

jbellis · on Jan 8, 2010

It sounded like they want to use it as more of a "real" database than a cache, since rebuilding data in case of a hardware failure is so painful. <shrug>

jedberg · on Jan 8, 2010

We actually looked at Cassandra and found it to be slower than memcachedb. However, we readily admit that we probably configured it wrong.

jbellis · on Jan 8, 2010

Hard to say, but we increased speed pretty much across the board about 50% from 0.4 to 0.4.2, by 50% (compounded :) to 0.5, and looking at 100% already for release-after-0.5... and we're ready to help configure on IRC :)

(Also, I'm not sure when you were looking at it, but bootstrap -- adding nodes without any downtime -- is done now.)

ShabbyDoo · on Jan 7, 2010

I'm bothered by the need to run SW RAID on top of HW RAID. One would think that Amazon would sell faster EBS "disks" for a premium.

And slower disks for a discount? But, I guess that's what S3 is for.

psranga · on Jan 7, 2010

I think it works like this:

Even if Amazon uses 'k' hdds for one EBS "disk", since you're sharing the real hdds with other users, you don't get 'k' hdds' performance, you only get a fraction.

By RAIDing over 'n' EBS "disks", you are effectively compensating for the reduced performance due to sharing.

ShabbyDoo · on Jan 7, 2010

I get what the stack looks like, but it seems really broken and likely quite inefficient. Thus far, Amazon has gone after greenfield applications which can be written within the constraints of their cloud platform. However, there are a ton of people hosting their own SQL database-based apps where a single DB is the bottleneck. Without significant refactoring, these apps can only scale vertically with the DB. So, while Amazon provides nice, big boxes to run SQLServer/MySQL/etc., disk performance is that of a desktop machine -- hardly a balanced system. How many more customers could they capture if they offered premium, high-performance storage options?

ericd · on Jan 7, 2010

You hit the nail on the head as to why I'm looking into a physical DB server with RAIDed SSD's instead of hopping onto EC2. I would love to use Amazon and not have to deal with the potential headaches of managing physical machines, but the stories (maybe FUD) of having to raid EBS instances, spool up 20 instances to find the winners and kill the rest, etc etc really kills the appeal.

If they could promise me consistent database performance on par with a really nice physical machine, I would gladly fork over 500/month for it.

nethergoat · on Jan 8, 2010

As someone who has spent the last year and a half running a 200-(persistent)node environment on EC2, including multiple m1.large and m1.xlarge DB pools, I can assure you those stories stem from FUD and unreasonable expectations.

Yes, EBS is not very fast, especially compared to local disk. You can work around this, however, by configuring multiple volumes in a RAID configuration (as you have mentioned), or by scaling out with additional nodes. The size and workload of your database will dictate which is more cost-effective.

Spooling up many nodes to find the "best" one is completely unnecessary. In my experience, EC2 nodes have been remarkably consistent in performance. I won't say I run a load test on every one, but I will say that over 100,000 node launches, I've never had to shut down a poorly-performing instance that couldn't be attributed to a hardware issue (rare, and for which Amazon sends notifications).

Don't listen to the naysayers. Come on in, the water's fine!

ericd · on Jan 8, 2010

Thanks very much for the FUD-debunk, it's always great to get advice from someone who has thoroughly kicked the tires of something. I may start considering it once again.

Would you mind sharing what kind of small-block IO/sec numbers you've seen from the EBS's? My app tends to generate lots of IO with not a huge amount of cacheability, and it has a relatively small dataset, which is why I'm considering SSDs in the first place.

wmf · on Jan 7, 2010

An EBS volume has the performance of a ~10-disk RAID; it's hardly desktop class. It would be nice if they offered wide-striped volumes, but Amazon's strategy is to not do anything that customers can kludge for themselves.

ShabbyDoo · on Jan 8, 2010

This guy's blog posting suggests otherwise:

"Remember, the speed and efficiency of the single EBS device is roughly comparable to a modern SATA or SCSI drive."

http://af-design.com/blog/2009/02/27/amazon-ec2-disk-perform...

Perhaps EBS has improved drastically over the past year?

mark_l_watson · on Jan 8, 2010

I really like 'behind the scenes' stories like this from big sites like Reddit, Facebook, Heroku etc. I work on a smaller scale of a few EC2 at a time, but I really enjoy the scaling info.

shrike · on Jan 7, 2010

Does anyone know if reddit is doing anything special to create the RAID? Or is it just mdadm?

jedberg · on Jan 8, 2010

Just mdadm.

jacquesm · on Jan 7, 2010

One major reason not to hop on the cloud bandwagon just yet is issues like these. The more layers underneath that are not under your control the more layers you'll have to add to remedy that.

Systems with excessive complexity are hard to debug, especially when it comes to analyzing performance issues.

Given complete control of the hardware from the ground up it can already be quite hard to accurately pinpoint a bottle neck so you can solve it. Adding a lot of stuff between your code and the hardware is not going to make that any easier.

Typically a stack has 6 layers before you get to your application: drive, controller, driver, filesystem, database, app.

In a cloud environment anything under the filesystem layer is effectively out of your control and out of your ability to troubleshoot. The solution, to add another layer of complexity in order to combat the slowdown is really the opposite of what an ideal cloud environment would give you.

After all, the #1 selling point of the cloud is scalability and performance.

I think that it would be best if Amazon worked together with the OP to resolve the issue as a problem ticket rather than to try to solve it by adding a software raid.

Of course, that's just armchair reasoning, not being in the hot seat makes life easier.

scott_s · on Jan 7, 2010

After all, the #1 selling point of the cloud is scalability and performance.

I think you're missing the key component that it's scalability and performance that you don't have to manage yourself. And, yes, when you don't manage it yourself, it's going to be much more difficult to diagnose performance problems. But someone else managing the infrastructure also means that people who would be incapable of diagnosing the problem anyway (either from lack of expertise or lack of time) have access to the resources.

jacquesm · on Jan 7, 2010

It definitely looks as though they ended up having to manage it themselves.

Basically this seems to put a fairly low upper limit to using 'the cloud' for something a little larger before you get back exactly the same kinds of issues that you were dealing with when using self-hosted hardware, only at a higher price point.

tentonova2 · on Jan 8, 2010

It definitely looks as though they ended up having to manage it themselves.

They're not "manag[ing] it themselves" -- it's a gigantic stretch to say that software RAID on an instantly-configured EBS volume is remotely equivalent to specifying, ordering, building, installing and maintaining a number of RAID arrays in a data center.

Basically this seems to put a fairly low upper limit to using 'the cloud' for something a little larger before you get back exactly the same kinds of issues that you were dealing with when using self-hosted hardware, only at a higher price point.

This is ridiculous. The primary issues with self-hosted hardware:

- Energy usage. Data centers only have so much power and cooling to spare, what's available costs money, and you constantly have to monitor/juggle utilization and search for more energy-efficient hardware, which leads to the next issue ...

- Physical plant. You often need to buy it well ahead of time to ensure that you'll have power and space available for expansion should you need it. You need to wire that space, buy switches and routers, and that leads to the next issue ...

- People. You'll need people to order, build, and install those servers. You'll need people to swap out dead components in those servers. You'll need trained people to install PDUs and console management systems, to configure and run your switches and routers.

EC2 and similar pushes these issues upstream. You can manage your servers in software. You don't have any routers or switches to manage, you don't have to order, build, or install hardware. You don't have to expend significant capital outlay on servers or rack space.

jacquesm · on Jan 8, 2010

I probably am not getting my point across very well.

Pushing issues upstream does not make them go away, it makes them someone elses problem. If that someone else doesn't take care of those issues then you end up having to solve them yourself.

And I've played enough with software raid that I know that configuring it to perform well is not a walk in the park, in fact I think it may be harder than getting a good hardware raid solution up and running on a dedicated box.

The nice thing of course of an EC2 setup is that once you've figured out how to do it you can do it again without much trouble.

As for the energy usage and the physical plant, that is entirely up to your way of using your servers. For instance I try to balance the quality of the service with the load on the servers to gracefully degrade the service when the load is at its peak (which is only a few hours per day anyway). That way I maximize my flat-rate payments on bandwidth and serverlease at a relatively small fraction of what it would cost me to get similar performance out of the various cloud suppliers offering. In a cloud environment those servers would be using just as much power and AC as they do today.

It takes me a little longer to get a server provisioned, on the order of 2 to 3 days, but that is 2 to 3 days, whether I order one, 10 or 100 servers. Very few businesses would ever need to grow faster than that.

Maybe my business is a lucky one in that it can make optimal use of a dedicated server setup but I see plenty of people choosing for a cloud based solution when if you run the numbers it makes very little sense.

The cloud comes in to its own if you have wildly fluctuating loads and/or jobs that need large numbers of machines for a relatively short period.

But for the majority of longer term high bandwidth uses I can't make the numbers work at all.

tentonova2 · on Jan 8, 2010

I'm curious about the following:

- What scale you're operating at.

- What vendor you've found that can provide you with 100 leased servers in 2-3 days while costing less than a cloud provider and not requiring you to maintain your own routing/switch/etc infrastructure.

- How you see a dedicated vendor providing managed leased server hosting, network services, on-hands management, etc, as to be genuinely different from a cloud provider -- other than requiring a significantly longer turn-around on provisioning and management tasks.

- How expensive (and for what length) the lease terms are on that server hardware. I've yet to find a quality managed hosting provider that will lease hardware at terms that come close to matching the pricing of either in-house maintained or EC2 provisioned servers).

jacquesm · on Jan 8, 2010

- several Gigabit dedicated

- http://leaseweb.com

- probably they are not very different than a cloud provider in that aspect, other than that I seem to be getting a pretty good treatment from hosting providers in general (EV1/The Planet excepted, they've gone downhill to the point that we quit hosting there).

There are differences between providers, but for the most part those are relatively small once you reach a certain level.

- The lease terms are variable depending on the use case, but the majority of the longer term leases are for a year to two years, flat rate published price is about 1E29 / Mbit exclusive VAT for 100Mbit, including server lease. I get a better deal than that but I've been asked not to publish it, I'm sure you understand.

A box with 24 1T drives, 8G of ram and dedicated 1G flat-rate uplink currently lists at E1199/month ex vat if you pay in one go for a year, a bit more if you pay month-by-month.

http://www.leaseweb.com/en/configurator/index/id/95

If you're located outside the EU then you do not pay VAT (and if you are in the EU you'll get it back).

Good negotiators will probably be able to shave some off that price, and if you are able to serve lots of bandwidth with relatively little cpu you can add another G for E750, which puts you under 1 euro / Mbit.

If I would do the same using Amazon I'd be paying a multiple.

nudist · on Jan 7, 2010

If they had a dedicated server stack, they would probably have done the same thing as a hardware RAID anyway. They just replicated in software what they would have done in hardware.

jedberg · on Jan 8, 2010

Actually, we did work with Amazon. The RAID was their suggestion.

kf · on Jan 8, 2010

Reddit went on the cloud bandwagon not because they thought it was superior to managed servers, but because Conde Nast's IT department sucked. With the ongoing growth, they were having trouble procuring the additional servers as needed, and moving Amazon solved that problem.

jedberg · on Jan 8, 2010

That's not entirely true. Yes, it is true that getting servers was hard, and that was definitely a factor.

But the bigger factor for me was that I was tired of having to build, image and rack all those servers. I liked the flexibility of EC2, and also not having to waste resources ordering a full rack's worth of hardware every time.

Cost was also an issue. Datacenter space in SF is expensive, but it had to be in SF, because that is where I was. EC2 proved to be much cheaper than physical servers.

I also like the fact that I don't have to run to the datacenter anymore when there is an issue. I just file a ticket with Amazon.

kf · on Jan 8, 2010

OK, thanks for the clarification. I'm trying to find the post I read that gave me that idea but can't find it. Was it in in your AMA?

jedberg · on Jan 8, 2010

Could have been. Or possibly something Spez said.

kf · on Jan 8, 2010

I thought it could have been a spez or kn0thing post but I went through all of them and couldn't find anything. I guess it could have been deleted, but I'm going to chalk this up to faulty memory on my part; you were there.

pavs · on Jan 8, 2010

Can you give us any reference to the "Condé Nast's IT department's fault" theory? From the best of my knowledge reddit infrastructure was always maintained by reddit stuff and Condé Nast had no influence on that.

The current reddit stuff did an "Ask me Anything" thread on reddit when the founders left and they said the reason for the move to AWS was purely because of price/scaling issues and some part of reddit was already using AWS even before Condé Nast bought them.

dgreensp · on Jan 7, 2010

Reddit is down super-often for me. This very minute, I can't log in or use the site logged in, it just serves me 503 errors.

rubyrescue · on Jan 7, 2010

sounds like a redefinition of 'in any way' to me...

So why am I singing the praises of Amazon and EC2? Mainly to dispel the opinion that the site getting slower since the move is in any way related to Amazon...(snipped)...Unfortunately, the single EBS volumes they were on could not handle these bursting writes.

aaronblohowiak · on Jan 7, 2010

This isn't an EC2 issue, this is a SAN issue. Wether it is EBS or an NFS drive, meh. This is an architecting issue and while it is a result of the underlying hardware, the underlying hardware is not the constraint.

rubyrescue · on Jan 8, 2010

agreed it's largely an architecture issue, however poor EBS performance is contributing factor and he seems to go out of his way to say that it's not...

pavs · on Jan 8, 2010

He went out of his way because most reddit user keep blaming AWS for the recent issues, as jedberg (reddit IT guy) recently mentioned the problem is not with AWS scaling but with reddit software scaling.

jedberg · on Jan 8, 2010

Blaming poor EBS performance would be like blaming Intel for their 4GhZ processor not being able to do your protein folding in 5 seconds.

It is simply a known limitation that has to be worked around.

Even if we owned the servers, we would have the same limitation -- eventually you just can't get the performance out of a single disk.