Hacker News new | past | comments | ask | show | jobs | submit login
This is your pilot speaking. Now, about that holding pattern... (googleblog.blogspot.com)
74 points by epi0Bauqu on May 14, 2009 | hide | past | favorite | 33 comments



Honest and classy apology. Nice, Google.


I liked that they admitted that it was an embarrassing mistake. The use of that word resonates with anyone who has ever done anything stupid only to realize it moments after.

For me it's always that same stupid routine operation that I've done 1000x before, so I turn my brain off while I do it. Then there's that moment of shocking clarity where you think "I didn't just... did I?" followed by the mad scramble to check and then the sinking realization that "yes, yes I did." shudder


Yeah. Honesty is nice. Owning up to failures is extra nice.

The twitter fiasco of first removing a feature because it was "confusing" [according to them] and then backtracking and saying that it was removed because of engineering problems was interesting to watch.


Honesty is not nice and neither is owning up to failures. It's expected. It's a sad state of affairs when there are enough people and companies that hold themselves to a lower standard that the ones that do what they're supposed to do are somehow worthy of some extra recognition.


Sounds like someone fucked up a BGP update. Happens all the time: http://www.google.com/search?q=bgp+update+outage+level3


Is it me or does it seem that when you are something like Google there should be a process to prevent just this sort of thing?


I hate when someone suggests adding process to fix a problem - and I saw this quite a few times already.

Most people clearly understand the benefits of adding process, but very few seem to realize the costs.

If I tried hard I am pretty sure I could create a checklist with 1000 items for each developer to go through, and no one could argue against any of the items - individually, they would all be ok / necessary / correct. However, if I forced every developer to go through the list every time, for every change, they would - rightly so - feel crushed.

With very few exceptions - where a new process is really granted, I see people trying to substitute either thinking or automation by process. Which is a recipe for bureaucracy, and in my view a good part of why working for a BigCo can be so miserable sometimes.

A new process should be a last resort, after we answered yes to: a) Is it really beyond us to automate this?, b) Is there some flaw in human beings that will ensure this will repeat? c) Are the consequences of this mistake really serious?


When I said process, I meant automated process. As a programmer, performing calculable tasks by hand is always out of the question :-P


I agree that process can be a demoralizing effect, but I like to play devil's advocate.

An hour of Google's revenue lost from 14% of their customers costs them about $350,000 (judging roughly by Q109 revenue numbers). Had it been 100% of customers impacted during that hour (i.e. a bigger goof-up) they'd have lost ~$2.55 million.

If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

So is it worth investing in processes to avoid that? Absolutely. Even if they can't find a good way to automate this, it's hard to argue against protecting against that sort of loss of revenue however you can.


It's incredibly myopic to say that an hour of downtime equals an hour of lost revenue. When my favorite takeout place has busy phone lines, I wait 5 minutes and then call back.

I wanted to search for something during the downtime, and I didn't go to Yahoo--I waited. They definitely lost revenue, but it is a ridiculous, baseless claim that everyone went somewhere else during the downtime. You have absolutely no data to make any such claim.

Furthermore, your calculated cost of an engineer's time is simplistic and inaccurate. It doesn't count the lost revenues from delaying the release of their work, and the reduced value of that money by gaining a lower time value for it (getting money earlier means more time to multiply it through investments and reinvestments).

And, even worse, you are comparing the DAILY costs of developer time (which you grossly underestimated) to something that happens, maybe, once or twice per decade.

It could cost them many millions--maybe even hundreds of millions or more--per year to implement such a policy.

That's why these mistakes happen--it is cheaper to fix the rare screw-up than to waste too much time checking everything, except for very few circumstances.


If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

Uhm, it's closer to $500k. Per day. So based on today's goof, which is presumably rare, with your batch of added paranoia they'd be still be out $150k for the day, and they'd be out $500k all of the days that there wasn't a colossal screwup.


You're 100% right. I picked a too outlandish number in my devil's advocating and back-of-the-napkin'd the engineering wrong. Sorry about that-- can't correct my comment anymore, unfortunately. You shoulda slammed me sooner. I can't see any rational reason to argue that way with those economics.


It turns out that you can reduce the number of surgical instruments left in patients by 33% by implementing checklists. Surgeons won't do it because it feels beneath them.


And it is. We have computers for this. Specifically, RFID chips, and readers.

Surgeons are very very expensive, so their time is too.


Surgeons are very very expensive, so their time is too.

Surgical instruments, on the other hand, are easy to replace ;)


You don't need to get that fancy. Just paint silhouettes of the instruments onto the counter. Same effect, simpler operation, bonus clutter reduction.



So get the nurses to do it for them, which, I believe, is what the researchers who were looking at this had to do.


what bothers me is why this isn't more like 90% (in other words, why you don't check all items in and out - anything else is pretty lame supply chain management)


I'm sure they have plenty of safeguards in place. Think about all the times in the last 5 years that Google's service has been seriously interrupted. Oh wait, there aren't any. Their turn around time on this bug was pretty fantastic as well. Discovered, analyzed, patched, and apologized for in under 24 hours. Nice.


You are right. It's unfair (and all too easy) for someone from the outside to point their finger and say, "Bad! They should have done X."

However, the reality is that Google has positioned themselves such that they have quickly become a utility. This goes beyond search, they are a communications network (email, chat) and an ad service among other things. How many websites depend solely on Google AdSense for their revenue?

Just like electricity, telephone service, and other utilities, we have come to depend on them for living our daily lives. Loss of service at a large scale cannot be tolerated.

This may have been a relatively minor event, but the point remains that it still seems much too easy for such events to take place. I feel that Google should be one of the companies pushing technology in this area forward. Hopefully they will release information on how they plan to prevent this sort of event from occurring in the future.


This is at least the second time this year that they've had a serious/major issue for an hour. The other being the snafu where they marked the entire world as malware.

They respond awesomely, but they're not immune to major issues and I doubt it will get any easier for them.



Sure, but even if something is faulty I would guess that they have a way to test these updates on only a small portion of their network/traffic (at least much smaller than the 14% they said was affected).


sounds like a lot of money could be made by allowing an engineer to "undo" a routing update


The thing is that routes aren't (usually) hard-coded anywhere: they're computed by each individual device, and while requests to perform the computation can be triggered upstream, there's nowhere that has a "one true" view of the network.

You could theoretically say "at time T1, everyone save your current tables" and then sometime later (when you were confident it had propagated everywhere) say "at time T2, revert back to your tables as of T1" but you'd have to assume that no devices had joined or left in the intervening time; in practice the only way back is to trigger a new computation and return to an approximation of the earlier state.


Yes, but this is really like trying to "undo" something that you have spoken.

When you make a routing update, it takes place immediately (generally). You can only undo it by making another update (presuming of course that your first update didn't impair your ability to get more data through...).

There are tools, like WANDL, that allow you to model your network and see what various changes or outages will look like in "real life". Of course, these simulations take time to setup and run. And, I've made simple routing updates before. I KNOW that this will work and I don't need to model it...


Likely hypothesis -- three thousand third party application developers who use google web services just got phone calls regarding application speed issues.


Ah, but that's also the nice thing about relying on somebody else's API: When it breaks, it's not your job to fix it.

Back in the dot-com days of '99, I built and maintained an app that was pretty much identical to what Google Maps released 5 years later (yeah, with tiles and javascript dragging & all that). The downside is that the backend was pulling from this crap GIS system that was always overloaded.

So it would bog down under load. And it was my problem. And it hurt.

So 10 years later, I run a little site built on the Google Maps API. Today it bogged down under load. And it was sunny out so I went out rock climbing because a whole team of smart people were scrambling to fix it for me and it wasn't my problem.

Sometimes "not my problem" can be pretty nice.


Not only that, I'm almost certain the google engineers would do a better job than I would! Faster, too...


You climb? Where at?

I always wanted to pitch the idea of having a YC meetup at a climbing gym.


Kinda everywhere. I travel most of the year, following the sunshine to spots with good climbing, surfing and wifi. I'm in the Lake District (Northwest Engand) at the moment.

I seem to find myself in the 'states for about 3-6 months most years, usually in LA. If there were a hacker crag session when I was in town, I'd be there.


I'm a climber who loves surfing and hacking. Looking for hires? haha.

A climbing meetup is something I've been planning. I will probably ping the local (nyc) folks to see if they're interested.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: