This is your pilot speaking. Now, about that holding pattern...

vegashacker · on May 14, 2009

Honest and classy apology. Nice, Google.

ironkeith · on May 14, 2009

I liked that they admitted that it was an embarrassing mistake. The use of that word resonates with anyone who has ever done anything stupid only to realize it moments after.

For me it's always that same stupid routine operation that I've done 1000x before, so I turn my brain off while I do it. Then there's that moment of shocking clarity where you think "I didn't just... did I?" followed by the mad scramble to check and then the sinking realization that "yes, yes I did." shudder

boundlessdreamz · on May 14, 2009

Yeah. Honesty is nice. Owning up to failures is extra nice.

The twitter fiasco of first removing a feature because it was "confusing" [according to them] and then backtracking and saying that it was removed because of engineering problems was interesting to watch.

jm4 · on May 15, 2009

Honesty is not nice and neither is owning up to failures. It's expected. It's a sad state of affairs when there are enough people and companies that hold themselves to a lower standard that the ones that do what they're supposed to do are somehow worthy of some extra recognition.

aristus · on May 14, 2009

Sounds like someone fucked up a BGP update. Happens all the time: http://www.google.com/search?q=bgp+update+outage+level3

dfj225 · on May 14, 2009

Is it me or does it seem that when you are something like Google there should be a process to prevent just this sort of thing?

igorhvr · on May 14, 2009

I hate when someone suggests adding process to fix a problem - and I saw this quite a few times already.

Most people clearly understand the benefits of adding process, but very few seem to realize the costs.

If I tried hard I am pretty sure I could create a checklist with 1000 items for each developer to go through, and no one could argue against any of the items - individually, they would all be ok / necessary / correct. However, if I forced every developer to go through the list every time, for every change, they would - rightly so - feel crushed.

With very few exceptions - where a new process is really granted, I see people trying to substitute either thinking or automation by process. Which is a recipe for bureaucracy, and in my view a good part of why working for a BigCo can be so miserable sometimes.

A new process should be a last resort, after we answered yes to: a) Is it really beyond us to automate this?, b) Is there some flaw in human beings that will ensure this will repeat? c) Are the consequences of this mistake really serious?

dfj225 · on May 14, 2009

When I said process, I meant automated process. As a programmer, performing calculable tasks by hand is always out of the question :-P

thorax · on May 14, 2009

I agree that process can be a demoralizing effect, but I like to play devil's advocate.

An hour of Google's revenue lost from 14% of their customers costs them about $350,000 (judging roughly by Q109 revenue numbers). Had it been 100% of customers impacted during that hour (i.e. a bigger goof-up) they'd have lost ~$2.55 million.

If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

So is it worth investing in processes to avoid that? Absolutely. Even if they can't find a good way to automate this, it's hard to argue against protecting against that sort of loss of revenue however you can.

edmccaffrey · on May 14, 2009

It's incredibly myopic to say that an hour of downtime equals an hour of lost revenue. When my favorite takeout place has busy phone lines, I wait 5 minutes and then call back.

I wanted to search for something during the downtime, and I didn't go to Yahoo--I waited. They definitely lost revenue, but it is a ridiculous, baseless claim that everyone went somewhere else during the downtime. You have absolutely no data to make any such claim.

Furthermore, your calculated cost of an engineer's time is simplistic and inaccurate. It doesn't count the lost revenues from delaying the release of their work, and the reduced value of that money by gaining a lower time value for it (getting money earlier means more time to multiply it through investments and reinvestments).

And, even worse, you are comparing the DAILY costs of developer time (which you grossly underestimated) to something that happens, maybe, once or twice per decade.

It could cost them many millions--maybe even hundreds of millions or more--per year to implement such a policy.

That's why these mistakes happen--it is cheaper to fix the rare screw-up than to waste too much time checking everything, except for very few circumstances.

wheels · on May 14, 2009

If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

Uhm, it's closer to $500k. Per day. So based on today's goof, which is presumably rare, with your batch of added paranoia they'd be still be out $150k for the day, and they'd be out $500k all of the days that there wasn't a colossal screwup.

thorax · on May 15, 2009

You're 100% right. I picked a too outlandish number in my devil's advocating and back-of-the-napkin'd the engineering wrong. Sorry about that-- can't correct my comment anymore, unfortunately. You shoulda slammed me sooner. I can't see any rational reason to argue that way with those economics.

thras · on May 14, 2009

It turns out that you can reduce the number of surgical instruments left in patients by 33% by implementing checklists. Surgeons won't do it because it feels beneath them.

bkudria · on May 15, 2009

And it is. We have computers for this. Specifically, RFID chips, and readers.

Surgeons are very very expensive, so their time is too.

andreyf · on May 15, 2009

Surgeons are very very expensive, so their time is too.

Surgical instruments, on the other hand, are easy to replace ;)

snprbob86 · on May 15, 2009

You don't need to get that fancy. Just paint silhouettes of the instruments onto the counter. Same effect, simpler operation, bonus clutter reduction.

eru · on May 15, 2009

http://www.newyorker.com/reporting/2007/12/10/071210fa_fact_...

TJensen · on May 14, 2009

So get the nurses to do it for them, which, I believe, is what the researchers who were looking at this had to do.

imajes · on May 15, 2009

what bothers me is why this isn't more like 90% (in other words, why you don't check all items in and out - anything else is pretty lame supply chain management)

tsally · on May 14, 2009

I'm sure they have plenty of safeguards in place. Think about all the times in the last 5 years that Google's service has been seriously interrupted. Oh wait, there aren't any. Their turn around time on this bug was pretty fantastic as well. Discovered, analyzed, patched, and apologized for in under 24 hours. Nice.

dfj225 · on May 15, 2009

You are right. It's unfair (and all too easy) for someone from the outside to point their finger and say, "Bad! They should have done X."

However, the reality is that Google has positioned themselves such that they have quickly become a utility. This goes beyond search, they are a communications network (email, chat) and an ad service among other things. How many websites depend solely on Google AdSense for their revenue?

Just like electricity, telephone service, and other utilities, we have come to depend on them for living our daily lives. Loss of service at a large scale cannot be tolerated.

This may have been a relatively minor event, but the point remains that it still seems much too easy for such events to take place. I feel that Google should be one of the companies pushing technology in this area forward. Hopefully they will release information on how they plan to prevent this sort of event from occurring in the future.

thorax · on May 14, 2009

This is at least the second time this year that they've had a serious/major issue for an hour. The other being the snafu where they marked the entire world as malware.

They respond awesomely, but they're not immune to major issues and I doubt it will get any easier for them.

wmf · on May 14, 2009

Yes, but everything has bugs.

http://nanog.org/meetings/nanog44/presentations/Monday/Gill_...

dfj225 · on May 14, 2009

Sure, but even if something is faulty I would guess that they have a way to test these updates on only a small portion of their network/traffic (at least much smaller than the 14% they said was affected).

matthavener · on May 14, 2009

sounds like a lot of money could be made by allowing an engineer to "undo" a routing update

gaius · on May 14, 2009

The thing is that routes aren't (usually) hard-coded anywhere: they're computed by each individual device, and while requests to perform the computation can be triggered upstream, there's nowhere that has a "one true" view of the network.

You could theoretically say "at time T1, everyone save your current tables" and then sometime later (when you were confident it had propagated everywhere) say "at time T2, revert back to your tables as of T1" but you'd have to assume that no devices had joined or left in the intervening time; in practice the only way back is to trigger a new computation and return to an approximation of the earlier state.

brk · on May 14, 2009

Yes, but this is really like trying to "undo" something that you have spoken.

When you make a routing update, it takes place immediately (generally). You can only undo it by making another update (presuming of course that your first update didn't impair your ability to get more data through...).

There are tools, like WANDL, that allow you to model your network and see what various changes or outages will look like in "real life". Of course, these simulations take time to setup and run. And, I've made simple routing updates before. I KNOW that this will work and I don't need to model it...

phatboyslim · on May 14, 2009

Likely hypothesis -- three thousand third party application developers who use google web services just got phone calls regarding application speed issues.

jasonkester · on May 14, 2009

Ah, but that's also the nice thing about relying on somebody else's API: When it breaks, it's not your job to fix it.

Back in the dot-com days of '99, I built and maintained an app that was pretty much identical to what Google Maps released 5 years later (yeah, with tiles and javascript dragging & all that). The downside is that the backend was pulling from this crap GIS system that was always overloaded.

So it would bog down under load. And it was my problem. And it hurt.

So 10 years later, I run a little site built on the Google Maps API. Today it bogged down under load. And it was sunny out so I went out rock climbing because a whole team of smart people were scrambling to fix it for me and it wasn't my problem.

Sometimes "not my problem" can be pretty nice.

TallGuyShort · on May 14, 2009

Not only that, I'm almost certain the google engineers would do a better job than I would! Faster, too...

yan · on May 15, 2009

You climb? Where at?

I always wanted to pitch the idea of having a YC meetup at a climbing gym.

jasonkester · on May 15, 2009

Kinda everywhere. I travel most of the year, following the sunshine to spots with good climbing, surfing and wifi. I'm in the Lake District (Northwest Engand) at the moment.

I seem to find myself in the 'states for about 3-6 months most years, usually in LA. If there were a hacker crag session when I was in town, I'd be there.

yan · on May 15, 2009

I'm a climber who loves surfing and hacking. Looking for hires? haha.

A climbing meetup is something I've been planning. I will probably ping the local (nyc) folks to see if they're interested.