Hacker News new | past | comments | ask | show | jobs | submit login
Amazon’s problem isn’t the outage, it’s the communication (geekwire.com)
135 points by webwright on April 22, 2011 | hide | past | favorite | 30 comments



My company is in the same boat, so I'm definitely sympathetic, but I think the criticism is a bit harsh. First of all, transparency is easy for a startup because no one is listening (the majority of your prospective customers haven't heard of you yet) and you don't have that much to lose anyway. It also pays dividends because early adopters are the people who appreciate that transparency the most. When you're a big corporation the economics change, and lawyers and PR call the shots; it's not because they're disingenuous, it's just because they have a lot more to lose.

The updates from Amazon have been okay in my opinion. They could have been better, sure, but do we want the engineers working on this to stop every 20 minutes to write up obscure details that don't yet paint a cohesive picture? The bottom line is better updates won't help them fix the problem faster, and actually could distract from the resolution.

If they don't post any more details even after everything's back up and running, then there will be something to complain about.


> do we want the engineers working on this to stop every 20 minutes to write up obscure details that don't yet paint a cohesive picture?

In a company the size of Amazon, I'd think they could afford to hire at least one person who had all the knowledge of a programmer, but who did no coding of his/her own—whose sole purpose would instead be to watch over the shoulders of the people doing the programming at times like this, and push-notify management and/or PR with the progress and difficulties in an accessible fashion. A technical stenographer, or code-bard, if you will.


> In a company the size of Amazon, I'd think they could afford to hire at least one person who had all the knowledge of a programmer, but who did no coding of his/her own

You're assuming such a person could actually produce any answers. I think the big problem in situations like this is that internally they actually don't know for sure what is going wrong. You could probably talk to 5 different engineers and get 5 answers with viable theories. They have ideas, good ideas, but they aren't 100%, the answers are being determined by experiments. I think having a person running around communicating lots of possibly incorrect ideas about what is going on could easily make things worse.

One big issue, I think, is that Amazon's own status page appears to have been incorrect - it's had a lot of green ticks against things that people have reported as being completely down. There's going to need to be some follow up on this sub-issue down the track on top of everything else.

I really don't envy Amazon at all in this situation. In many ways they are doing things with AWS that nobody has ever done before, and they are probably hitting classes of problem that nobody ever anticipated. Of course, they promised more than they could deliver so it's their fault in the end, but I still think the nature of what they are doing is under appreciated.


You'd be wrong. Even though it is a huge company, budgets are managed in really small groups, and no manager would "waste" a headcount when they only have a handful to begin with.


I've worked at a large FI where there is an entire team devoted solely to this task--"communicate with/between stakeholders and techies when something doesn't work." I have no idea what they do--or are qualified to do--if otherwise. They are 10 deep with 3 middle managers and a senior, out of a 100ish department devoted to the FI's .com and online services front-ends. There is another team for customer support.

Although definitely overkill, and understandably in an organization that doesn't particularly value efficiency either, it goes some ways to illustrate the importance of these communication channels in mission-critical environments.


Such a person, though, if kept by anyone, would be budgeted under a high-ranking executive or "internal affairs" group that would want to know what's going on in order to optimize their requests—not by some group wishing to publish updates and to thereby be optimized. I'd imagine that this person would be constantly reassigned to whichever project was currently experiencing the most emergency-like conditions (if they weren't given free-reign to find these projects—because, hell, they're in the absolute best position to hear the word-on-the-street of new emergencies.)


have you ever been trying to fix a problem with half a dozen people staring over your shoulder? It doesn't help.


I have, and you're right it doesn't seem to help. But funnily enough, I also found with so many people watching over my shoulders that I was forced to think cleanly and logically about the process I was going through to diagnose the faults and bring the systems back online. Since I was explaining to a group of shoulder-surfing engineers and managers each step of the process, I managed to get a clarity which I may not have had if left to my own devices. It's like the "Inflatable Engineer" scenario, except I had several real ones available instead :-)


"It also pays dividends because early adopters are the people who appreciate that transparency the most. When you're a big corporation the economics change, and lawyers and PR call the shots; it's not because they're disingenuous, it's just because they have a lot more to lose."

Ultimately I think "legacy" PR based on control of the message is obsolete.


I wonder if everyone here is using the same definition of "working on"?

Perhaps if the engineers had 20 minutely updates with each other, the coherent picture would emerge more quickly?

Hopefully they aren't doing the same "jiggle it until it works" that is being discussed elsewhere on HN right now; so, what are they doing, precisely?

I know we don't know because they aren't saying, but what are people who work in teams on huge important systems doing generally when they fail? What should they be doing?


Since Amazon is large and established enough to take public relations and marketing seriously it's reasonable to expect them to provide timely communication on a large outage like this one. I think they failed in this and some people at Forrester Research seem to agree with me.

http://blogs.forrester.com/tim_harmon/11-04-22-good_proactiv...

There are also a few informative links at the bottom of that note that I haven't seen here.


The worst thing for any service provider is to delay the initial confirmation that yes, something seems to be going wrong.

> http://blog.dotcloud.com/working-around-the-ec2-outage

"After one hour, Amazon's Health Dashboard was still pretending that everything was right."

There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z and looking into it." I get pinged like crazy within minutes of one my client's servers slowing/going down - I can't imagine that Amazon doesn't know something is wrong within seconds. If it turns out that it wasn't AWS but something else (Level 3 pipe issues or Comcast DNS errors) then just clarify that later. Why make every single AWS customer panic for an hour fearing that the fault lies within their code/services?


There really is no negative side for someone as large as Amazon to immediately put up a quick notice that "we are receiving complaints about x-y-z

Do you have an idea how many such complaints amazon is receiving on a normal day? Per hour?

Why make every single AWS customer panic for an hour

Diagnosing problems in a big system is not that easy.

A turnaround time of an hour is not too bad for a behemoth the size of Amazon, and when you consider that this was a worst-case scenario.


Even with a large service, you can immediately identify a fluctuation in the number of complaints (in addition to signals from monitoring tools). Speaking from experience. In fact, the larger the service is, the easier it is to statistically identify an uptick in the number of complaints per smaller time interval


Even with a large service, you can immediately identify a fluctuation in the number of complaints

Immediately is a relative term. I would say "1 hour" is pretty much as immediately as it gets on these scales.

I wouldn't be surprised if a significant complaint fluctuation only manifested long after amazon discovered the problem in their own monitoring.

statistically identify an uptick in the number of complaints per smaller time interval

Yes, but volume is not everything. You also have to qualify (triage) the input, get engineers on the case, confirm the issue, perhaps get clearance for a public announcement. All the while many of the key people are busy either trying to figure out what is going on, or trying to dispatch information the right people, or just running around waving their arms furiously.


The title is probably the most suprising thing that I've learned running a hosting business. I think most people would prefer that I take the time to email everyone even if this means I take time away from fixing the problem and the outage lasts another 20 minutes.

To me, this is shocking. My PFY, who is even more nerdy than I am still doesn't believe that an incomplete (which is to say, "I see X problem but I haven't figured out how to fix it. I'm working on it") message has any value at all.

This is something that I think is difficult for nerds to learn, because it doesn't make sense to us. But it's true. Most people, even fairly technical people, are willing to accept slightly increased downtime for better status updates, even when those updates largely consist of "I'm working on it"


I don't know about your customers but ours want both excellent communication and for the problem to be fixed immediately. This means having more staff during an outage.


Yes, of course. My customers, though, don't want to pay very much, and that's another tradeoff. I think I'm pretty up front about what I'm selling and what tradeoffs I make, and for some people, this is a good tradeoff. There are other services that have made different tradeoffs, that work out better for different people with different needs.

I'm just a little surprised about my customer's preferences in this particular tradeoff.


As a reducio ad absurdum, suppose that Amazon was secretive beyond all dreams of a CIA director, but never ever failed. Would people care about their communications strategy?

Likewise, if they could telepathically beam status updates during a pattern of approximately half hourly failures, would they be feted or avoided?

The updates probably have a lawerly feel because when the engineers are asked for a status update by management, they get a "piss off, we're working on it, don't bug me". Indeed a smart manager would refrain from indulging the instinct to constantly interrupt work to receive status updates. Perhaps Amazon has such managers.


> As a reducio ad absurdum, suppose that Amazon was secretive beyond all dreams of a CIA director, but never ever failed. Would people care about their communications strategy?

This is not the same. No one cares about active communication, when things go as expected. However, failure/unexpected outcomes are supposed to be communicated properly.

> The updates probably have a lawerly feel because when the engineers are asked for a status update by management, they get a "piss off, we're working on it, don't bug me".

The relationship between a boss and an employee is different than a customer and a company. More so, A boss asking for update, is not the same analogy as company providing an update.

When I(an employee, and most of my coworkers) work on some bug fix, makes things inconvenient others (including my bosses), I try to communicate the status of the issue and progress with my efforts actively, even if they dont ask for it, because this is what I believe is the professional thing to do.


OP's complaint is that he's not getting a ringside seat. There's a difference between asking for minute technical updates (where I made the analogy with interfering management) and getting general, high-level status reports.


I'm sorry but have you ever worked in a large enterprise IT organization?

The JOB of the manager is to stay abreast of the issues his team are working on and provide REGULAR updates as to progress. Even if that means "We are still working on problem X, expected resolution is Y"

I managed IT for an entire division in Lockheed. You think that a company continually against the onslaught of Chinese hackers employs "smart managers that refrain from indulging ... to receive status updates"

WTF are you thinking?

The err in logic on your part is that, sure, it may not be best to provide minute by minute to your customers - but be it known, you MUST provide accurate and regular status internally when you have one of the HIGHEST VISIBLE SERVICES of the freaking Internet.

So, that would tell us that there was likely either internal chaos, a group with no idea what the root cause was OR a failure of the externally facing IT communications channels -- but to say that it is good management to stay effectively out of the loop is false.

Now, Ill give you some credit in your statement: You did say "constantly interrupt" -- but if your interactions with your team are such that your inclusion in situational awareness is equivalent to an interruption, then there is something much more wrong with the structure of your team/organisation/ability to manage.


> The JOB of the manager is to stay abreast of the issues his team are working on and provide REGULAR updates as to progress. Even if that means "We are still working on problem X, expected resolution is Y"

This is pretty much what Amazon have provided, that OP's link complained about. Management-level updates: "we're working on it", not a line by line log of technical hypotheses.

I would hope that in a critical situation they'd realise most customers want the service restored, not to know about why it's broken in exhaustive detail.

edit: removed snark.


I agree - they could have definitely handled customer communication much better.

Until we get more information on the exact details of the EBS failure (it may be available, let me go look, nothing yet.), I would guess that some of the following is true:

We know that this was a "network error" that caused "EBS to begin to replicate"

What We don't know is if this was the result of:

Bug in unknown device layer that caused network problems Bug in Network gear that caused EBS device problems Network routing problem due to fat-fingered config change Unforeseen design flaw brought to light via new spiffy routing change.

What does appear obvious, is that it is either still unknown/understood on the part of amazon -- or it is so severe that we are not being given any information because:

Either they have all hands on deck and cant give good updates They really don't know how to handle communications They have some serious damage control to figure out (this could be an attack of an exploit on their system and they cant let word get out yet)

There are a lot of reasons for the poor communication from them - but my bet is just that they had some device/firmware build/configuration explode and they don't know how to quite fix it just yet - or their architecture was so dependent on that particular thing that failed that they have to figure out a really big/hard problem really fast.


    We absolutely love AWS because of the pace of innovation
    and scale that it has allowed us to accomplish. But after 
    today’s episode is over, we will have a big decision
    to make.

    We can spend cycles designing and building technical belts
    and suspenders that will help us avoid a massive failure 
    like this in the future, or we can continue to rely on a 
    single huge partner and also continue our break-neck 
    pace of iteration and product development.
Some could make a case that "Improving Stability" is "Product Development".


Indeed they could, however one could equally make the case that working on (say) the Photoshop code base is product development, having to rewrite or work around faulty NFS or ext3 File System just to get Photoshop to work is an aberration.

"Improving Stability" of AWS is rightly "Product Development" that Amazon should be undertaking, not their clients on an individual ad hoc basis.


Improving the stability __of their offering__, which they currently host on AWS, is AWS's clients responsibility.


Amazon's problem is transparency.

Their technology stack, their failover and redundancy setup — nobody knows it (exactly). Nobody can review the measurements and rate it. Nobody can easily swich to other options if AWS isn't a valid option anymore.

Besides the AWS marketing saying you just buy into another SPOF.


While they make the obvious point that communication has been bad, the analysis loses site of reality at some point. With an outage of this size and duration, clearly the outage is the problem. Better communication might smooth some feathers, but at the end of the day if a lot of your big customers are down it won't matter a ton.

I've heard one of the arguments several times: that good early communication would have allowed many customers to move off of the East region early and avoid bulk of the problem. Given the size of the East region, it's hard for me to imagine they would have had the ready capacity to accept some sort of mass migration. Worst case you could imagine load/peak use problems significantly degrading service in other regions as customers moved.


It's worth noting that Heroku's communication (status.heroku.com) has been terrific.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: