Surviving AWS Failures with a Node.js and MongoDB Stack

shykes · on Oct 24, 2012

Here's how I do it:

  $ pip install dotcloud
  $ echo 'frontend: {"type": "nodejs"}' >> dotcloud.yml
  $ echo 'db: {"type": "mongodb"}' >> dotcloud.yml
  $ dotcloud push $MYAPP
  $ dotloud scale $MYAPP frontend=3 db=3

This will deploy my nodejs app across 3 AZs and setup load-balancing to them, deploy a Mongo replicaset across 3 AZs, setup authentication, and inject connection strings into the app's environment. It's also way cheaper than AWS.

The only difference with OP's setup is that the Mongo ports are publicly accessible. This means authentication is the only thing standing between you and an attacker (and maybe the need to find your particular tcp port among a couple million others in dotCloud's pool).

(disclaimer, I work at dotCloud)

veesahni · on Oct 24, 2012

"It's also way cheaper than AWS."

3 AWS Small instances cost under $200 / mo and come with 1.7GB of RAM each.

The dotCloud pricing calculator is coming up with $700 / mo for 3 mongodb instances with 1.7GB of RAM.

Obviously this isn't an apples to apples comparison. But How are dotCloud instances different from AWS instances?

shykes · on Oct 24, 2012

It's cheaper at equivalent level of best practice:

* For a clean architecture you want to isolate each Mongo and node process in its own system. So you need 6 instances, not 3.

* You'll need load-balancers in front of these node instances. That costs extra on AWS, and is included on dotCloud.

* Did you include the cost of bandwidth and disk IO in your estimate? Those are extra on AWS, but included on dotCloud.

* Monitoring is extra on AWS. It's included on dotCloud.

* I love to have a sandbox version of my entire stack, with the exact same setup but separate from production. That's an extra 2 instances on AWS (+io +bandwidth +load-balancing +monitoring). It's free on dotCloud, and I can create unlimited numbers of sandboxes which is killer for team development: 1 sandbox per developer!

* We only charge for ram usable by your application and database. AWS charges for server memory - including the overhead of the system and the various daemons you'll need to run.

* For small apps specifically, you can allocate memory in much smaller increments on dotCloud, which means you can start at a lower price-point: the smallest increment is 32MB.

I didn't even get into the real value-add of dotCloud: all the work you won't have to do, including security upgrades, centralized log collection, waking up at 4am to check on broken EBS volumes, dealing with AWS support (which is truly the most horrible support in the World, and we pay them a lot of money).

+ Our support team is awesome and might even fix a bug in your own code if you're lucky :)

Teef · on Oct 25, 2012

Recommendation is to validate "best practice" claims. Doesn't matter what hosting solution used. Measure, measure, measure to make sure not only are you getting said claim but also that the end result meets your expectations. An example is in the past I had 7 "instances" (as shykes points out make sure they are hosted on separate nodes!) 4 of which where load balances Python web app. One of the nodes was overloaded so 1 out of 4 requests was very slow (5-10x). This was a big ajax app so initial page load would hang on the request(s) to that one instance. My point was since I had measured I could see that the node was the problem and now that I am on dedicated EC2 each node is consistant. Good luck.

shykes · on Oct 25, 2012

That's good advice. As the saying goes: "trust, but verify".

Regarding your performance issue - most platforms (including dotCloud) enforce ram and cpu separation between nodes, but are vulnerable to IO contention at some level. This is also true for EC2 if you use EBS: your standalone instances will almost certainly, at some point, suffer from degraded and inconsistent performance because another instance is competing for IOPS [1].

You can avoid this with the new "provisioned IOPS" volumes [2], or by skipping EBS altogether for stateless components.

[1] http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-i...

[2] http://aws.amazon.com/about-aws/whats-new/2012/07/31/announc...

sgrove · on Oct 24, 2012

This seems ridiculously nice to me, but I haven't heard much buzz around this in the community. Any thoughts on why there's comparatively less noise about dotCloud's offering? It seems far nicer in many ways than others.

From what I remember when trying to get the copy/paste instructions written for Kandan so others could deploy to dotCloud to get started with it, there was a steeper learning curve for dotCloud than for other services (with the possible exception of CloudFoundry, which was a sunk cost for us at that point anyway...). Maybe the on-boarding is a bit tough for new users? Where do you see the most significant drop-off in your funnel? If you don't mind sharing, of course.

shykes · on Oct 24, 2012

> This seems ridiculously nice to me, but I haven't heard much buzz around this in the community. Any thoughts on why there's comparatively less noise about dotCloud's offering? It seems far nicer in many ways than others.

I think there are 2 reasons.

One reason is simply that we're better at building the product than the buzz. As it turns out "developer buzz" is not organic, it is something that must be engineered, like any other feature of the product. There are people who specialize in crafting and projecting an image of success in a way that appears authentic. It is a difficult and highly specialized, not to mention it involves a fair amount of "fact distorsion" that doesn't appeal to us.

The second reason is that we're successful without it. When you and I say "developer buzz" we usually mean "HN-reading bleeding edge developer buzz", but 99.999% of our addressable market doesn't read HN. We crave our peer's appreciation and respect as much as everyone else - but at the end of the day, that's not what pays the bills. In our case, mid-aged developers and IT managers looking to remain competitive while circumventing office politics to get their development server in 6 weeks instead of 8 - that's what pays the bills.

sgrove · on Oct 24, 2012

Sounds reasonable, very cool. I suspected it was something along those lines - a demographic that had money (and real problems) that I wasn't personally connected to.

traxtech · on Oct 24, 2012

I was interested, I digged into your docs : no postgresql scaling :(

helper · on Oct 24, 2012

Considering how often EC2 outages are EBS related, we've moved all our servers off of EBS to ephemeral drives. I'm surprised there aren't more people advocating this route.

zorked · on Oct 24, 2012

I think part of the problem is that EBS issues also cause ELB problems, from that I read here on HN. I wouldn't know because we only use us-east for Hadoop.

On the other hand, our Cassandra cluster runs on ephemeral drives and it's way better than EBS even with the guaranteed IOPS thing. Everyone should definitely give this option a try.

helper · on Oct 24, 2012

Yup. ELB is great except it uses EBS. So part of our migration was to move off of ELBs.

shykes · on Oct 24, 2012

Same here, dotCloud originally used ELB and we eventually moved off, which brought immediate and huge gains in latency and overall reliability.

taligent · on Oct 24, 2012

Add inability to access AWS Console as well to the mix.

diminoten · on Oct 24, 2012

Reddit refuses to move away from their current infrastructure, despite being held together with little more than string and silly-putty.

According to a dev, they haven't even talked about it. Simply hasn't ever come up.

So Reddit's gonna keep going down like this. Don't be like Reddit.

justinsb · on Oct 24, 2012

You need to be in multiple regions to tolerate EC2 outages, not just multiple AZs. Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record. Though I can well understand that designing for EC2 region failure is not worth the cost for most systems.

ceejayoz · on Oct 24, 2012

> Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record.

Doesn't everything in their track record indicate that regions are nicely partitioned from each other? Even the biggest region failures they've had have stayed completely isolated to that region.

justinsb · on Oct 24, 2012

AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.

joeyi · on Oct 24, 2012

Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.

on Oct 24, 2012

[deleted]

justinsb · on Oct 24, 2012

That FAQ you yourself pointed to doesn't mention that there are common components.

We now know that when Amazon said - in that very FAQ - "even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone" they were very carefully not lying, but implying something that simply isn't true. We didn't know 18 months ago that multiple AZs would fail simultaneously (unless there was e.g. a huge earthquake). I agree that we know that now.

You believe we won't wake up at 3AM one morning to learn of an unanticipated way that multiple regions will fail at the same time. I don't share your faith.

Edit: This in reply to joeyi's comment above. It got double-posted, and I replied to one of the copies at the same time as joeyi deleted it!

joeyi · on Oct 24, 2012

I share your paranoia in general (as ops), but can assure you that regions are very isolated from one and other. I know that releases are rolled out on a very long schedule (think quarter long release), and that is to prevent what you describe.

I would argue that the application (ie: the application being hosted on AWS) probably is going to fail before multiple regions do simultaneously and that should be addressed, before thinking about going multi-provider.

justinsb · on Oct 24, 2012

Do you work for AWS as well? If so, I'd ask that team AWS spend less time astro-turfing on HN, and more time documenting your systems, so we can assess these risks for ourselves.

For example, I haven't heard of any precautions taken against a thundering herd of clients retrying requests in other regions if us-east goes down. What does AWS have there? How much spare capacity do you run in each region?

res0nat0r · on Oct 24, 2012

Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.

justinsb · on Oct 24, 2012

Source?

AZs were supposed to be independent; they aren't. Fool me one...

res0nat0r · on Oct 24, 2012

I used to work on the EC2 team. The regions are wholly independent of one another.

justinsb · on Oct 24, 2012

I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)

giulianob · on Oct 24, 2012

There were comments during the failure that AWS wasn't properly switching to use the available zones during the outage. That's what I find troubling. You are paying extra for some guaranteed availability and everyone keeps saying thats how you prevent downtime during outages. Then when the times comes it doesnt work?

justinsb · on Oct 24, 2012

If you were affected by this, I hope you got a big refund (1+ month).