Hacker News new | past | comments | ask | show | jobs | submit login
How a load-balancing bug led to worldwide Chrome crashes (code.google.com)
176 points by jpdus on Dec 11, 2012 | hide | past | favorite | 83 comments



The last discussion was mostly about how the title was linkbait. I want to hear people's opinions on whether they think it's appropriate for a browser (Chrome) to be designed such that it doesn't operate independently- that it can be crashed (or self destruct bug, insert your own word here) by a remote server at any time.

To my knowledge, Firefox doesn't do that. Safari doesn't do that. Internet browsers are probably the #1 most important app on a computer these days, browser reliability is vital.


Chrome Sync is, AFAIK, not a push service. Something polled a Google server, it returned a bad answer, it crashed the browser. Why is this important?

Because it's entirely possible that Firefox or Safari, for example, could have been crashed by contacting the safebrowsing server, and the safebrowsing server returning an answer that crashes it.

Firefox also does remote firefox update checks and plugin update checks, etc.

None of the browsers you mention are "independent" of internet servers anymore. They are meant to function independently, as is Chrome, but exactly the right remote bug could likely crash all of them.


Why is it possible to receive a response that crashes the browser? I'll allow (with reservations) your premise that browsers aren't so independent anymore. But it seems to me that input from an external source (even trusted) ought to be validated, and malformed input should raise a warning or something. The fact that crashing is a possible behavior upon receiving unexpected input is surprising to me.


Well that's the point of a bug, isn't it? Chrome didn't handle malformed input appropriately and crashed because of it.

It wasn't a command that shut down Chrome, it was just poor handling of an edge-case which resulted in unexpected behavior (crash).

It's a sad truth that most programs will explode if you fling garbage at them. When push comes to shove, many development timelines don't have room to bulletproof against everything


Why is it possible to gain admin access to an operating system by sending requests that crash the web server? Bugs happen.


I have safebrowsing and auto update disabled since FF3.6


The crash wasn't happening if you signed out of the sync


Exactly. I don't use chrome sync, never observed this crash.


It's a crash bug. Bugs happen.

It is not a design flaw. Sure, this specific vulnerability would not be there if the remote sync feature wasn't there, but people like features.

Chrome has a pretty good security track record. I'm not worried.


It's a design flaw. Should be designed like this:

    try {
     // Do syncing things.
    } catch(...) {
     // Continuing browsing.
    }


1) You can't do that in C/C++. At best you can catch only a subset of failures that way.

2) Any proposed design choice must be implemented with code, and that code can have bugs and crash. That is what happened here.


> You can't do that in C/C++

Well, first, the sync code might not need to be in C/C++. It could be in a sandboxed, safe language instead. Other browsers do that.

Or, at minimum, it could be in C/C++ but at least in a sandboxed side process, Chrome has the capabilities for that.


At some point, the results of what the side process did have to get communicated to the parent process. You can reduce the size of the channel, but you can't close it completely. Bugs happen. Bugs in never-executed code happen and tend to stick around longer.


> You can reduce the size of the channel, but you can't close it completely.

Of course, no one would argue otherwise. Every time you decide how much to reduce it, you define a tradeoff in terms of work vs. benefit.


> 2) It is a design choice on the Chrome team to fail fast and hard. It's better to get crash reports to our automated crash servers with diagnostics and stack traces than to have reports in the field of weird behavior with no way to debug.

It's a design choice that your product... which some people (myself included) pay money for, crash hard so that you can get better diagnostics? Sounds like misplaced priorities.


I disagree.

This sort of catchall and keep going error handling could leave your browser in a completely unknown state. It could start making bad requests, making the wrong requests, start leaking info, or more likely crash elsewhere but with a much less clean crash log.

When you don't know how to handle an error such that it bubbles all the way up to the top, often the best thing to do is crash. At least then you might get the logs that allow you to fix it and turn around the fix quickly and with confidence.

Crashes suck, but crashing and not knowing why sucks more.


This. I can't believe how many conversations I have had to have in my (short) career trying to convince people that catch(...) { /* ignore */ } is not an error-handling strategy, it is an error-ignoring strategy and opens you up for hilarious ROFLExploits, heisenbugs, and all manner of fun things. As a user an app crash sucks (I know that well, all apps I use have crashed before), but in order to fix it the developers generally need either a repro (almost no one provides these, at lest not reasonable ones, this case was an exception as the repro was trivial) or a crash-dump (less helpful as not all the info you need is always there). If you have a catch-all you have neither of these, at best you have vague reports that sometimes, when users do X, Y and Z and have been using your product for 18 hours then 'weird shit' starts happening. This is NOT the kind of bug you want to investigate if you value your sanity.


Sorry, I removed the point because I realized it wasn't relevant to this particular bug. This bug isn't a case of an assertion failing (which is the "crash hard" bit). It was just a logical failure. A bug.

(Incidentally, eliminating these kind of rarely executed branches is a bugaboo of mine. They frequently have problems.)


> It's a design choice that your product... which some people (myself included) pay money for

How do you pay for Chrome?


Purchase a Chromebook.


You pay for Chrome?


You should go interview with google's chrome team and bring this up, I'm sure it's something they hadn't thought of.


Among other things, that try..catch won't do anything for SIGSEGV(int * p = 0; * p = 1) and SIGFPE(1/0). You can handle the signals, or you could miss them like you did just now. That won't be a design issue but an implementation bug.


I think there's a disagreement about what is design and what is implementation.


An implementation bug of this scope would require several bugs; in the code itself and in the (hopefully many) test cases.


If the code in the try block caused a segfault, would that have helped?


On windows you can catch segfaults with __try and __catch.

Trust me when I say you don't want to be writing code in a stack frame above someone else who catches and ignores exceptions using them.

There are actually some cases where windows will catch and discard segfaults if you have them in response to certain window messages. That bug was hard to track down, let me tell you.


In what way was it 'designed' to not operate independently? The browser is - in theory - perfectly fine operating when sync is broken (and in fact wouldn't have triggered this were sync fully unreachable). This was simply a bug in sanitizing input, nothing further. Not different in flavour to input sanitization problems within the javascript engine, which have been known to occur as well.


To your knowledge there are no crash bugs in the Firefox sync code? How much are you willing to bet you're right? I seriously doubt an entire module like that has absolutely nothing wrong with it.

It's pretty ridiculous of you to point at a single mistake in implementation and blame the entire sync feature.


Firefox Sync is written in JS, not C++, so segfaults are unlikely.


I hope you are kidding. JIT's often have a large number of crashing bugs, usually even more than static compilers (because it is often hard to reproduce every set of circumstances that cause something to happen, unlike static compilers)


The point is that the probability that JS code will segfault the browser is dramatically less than the probability that C++ code will segfault the browser.


I don't believe this for a second. I might believe it if you said "the probability that commonly used JS code will segfault the browser is dramatically less than the probability that browser-specific C++ code will segfault the browser". Which would be a very different claim.

Remember that most JS is popular JS, with some small amount of custom lines. JS seems less crashy because people don't use as much "random" JS in general.



But the same JIT is used by all programs. Less code, executed more.


The network stack is written in C++. It's entirely possible for the sync service to return an HTTP response that tickles a bug there.


It was a Chrome Sync bug. If you had Chrome Sync enabled, it crashed. Believe it or not, Chrome Sync is actually a very useful feature for those of us who go back and forth between computers on a daily basis.


I am similarly troubled.

We back up so much of our tools and data, but without a working browser, we're sunk. Especially non-technical people.

That a huge percentage of the internet clients in the world can be simultaneously removed from accessing the internet, either intentionally or accidentally, is troubling me this morning.


That's why it is useful to have different browsers, some of which aren't as connected as Chrome is.


> that it can be crashed (or self destruct bug, insert your own word here) by a remote server at any time.

It isn't by design that syncing can affect the whole browser; it's a bug in the syncing code which should have been handled. There is no self destruct bug, and calling it that is incorrect. Are you aware of the fix?

http://src.chromium.org/viewvc/chrome/trunk/src/sync/engine/...

The fix is just checking if the model is valid before making the call which was throwing the out of bound exception.


I think cross-device syncing is valuable functionality that I enjoy. Obviously this comes with a risk that if bad data is sent that is unhandle-able by my browser, it may cause a crash. The fact that syncing is elective and toggle-able is a great feature to be included.

I think it's a bit sensationalist to still refer to it "being crashed" at any time, rather than saying it "may crash due to a bug".


you could just disconnect from the internet, and the browser stopped crashing.


Helpful!


FWIW the same sort of bug could have occurred in the "auto-update" features of any browser. The problem was the client's failure to handle an unexpected response from the server.


I think your outrage is misplaced here. Certainly a browser should not crash if a given server happens to be offline, but we can easily extend that to say that a browser should not crash, period, can we not? But that is a rather arbitrarily high bar. Browsers are in the business of connecting to sites over the internet and if it is possible for such communication to crash the browser (which is likely to be true for almost any browser) then that's a fairly equivalent problem.


Hmm, a load balancing bug which eats several production services for lunch and has a bunch of second-order effects. I wonder if it involves running a script and pushing the output without looking at a "diff" view to see what changed, and they managed to push a config which sent all of the world's traffic to one location.

It seems like just the other day when I was thinking about this very thing. http://rachelbythebay.com/w/2012/11/19/lb/


[deleted]


bns_aggregator.py, not even once?


Previous discussion (perplexingly marked dead): http://news.ycombinator.com/item?id=4904125


Worryingly marked [dead] in fact, how did it end up dead?

It might not be the greatest article, but it highlights a very real and new point of failure that cloud apps are introducing. I was hoping for a good discussion to check out later.


Probably because of the horrible link-bait title and general inaccuracy of the article. Especially when this link was also on the front-page (a clear, concise description of what happened without the added FUD).


Never ascribe to malice that which is adequately explained by incompetence.. :) HN's code is weird in places, e.g. it's possible to accidentally kill your own posts by double-clicking submit (I can't remember why it does this, but there was a sensible-ish explanation)


Yep, happens the same thing with comments (found out thanks to a mouse malfunction)


People probably flagged it over the crappy title and the fact that much of the comments were meta discussion.


I'm surprised at the level of hyperbole here on this thread.


on the server side, it looks like the problem could have been avoided with better types - it seem that there was a confusion between status values that can include 0 and those that cannot (alternatively, perhaps better, there was no status for the case where the status was undefined?) and then a hand-written assertion that a particular case could not happen (and so was not tested for).

the bug report describes all that, roughly (if i've understood) but doesn't seem to be worried about the higher level issues - the inconsistent types and need for fragile human assertions about type logic.

(not java bashing - don't see why this couldn't be solved in java)

but i guess this is just a bug report. for an outage like this i suppose there's going to be a major review? is that all internal? would be interesting to watch.


> (not java bashing - don't see why this couldn't be solved in java)

Chromium is c++. The code diffs in the bug report are c++. Where is Java coming into picture?


oh, sorry - didn't look at code, assumed java from some comment. but either way, same point - when you talk about solving problems with types, people tend to assume you're thinking of haskell or similar, but just because you're not using some super-cool functional language doesn't mean you shouldn't demand all you can get from your type system.

(ps i was talking about the code on the server side; not chromium)


> (ps i was talking about the code on the server side; not chromium)

Oh I see.

> on the server side, it looks like the problem could have been avoided with better types - it seem that there was a confusion between status values that can include 0 and those that cannot (alternatively, perhaps better, there was no status for the case where the status was undefined?)

From what I understood, they are talking about protocol buffer types. The sync server sent message to chromium to throttle for all types say A, B, and C. Chromium didn't know about type C, some code returned 0 for unspecified type, another piece of code calculates index based on what is returned in the previous step, and then due to the 0, a negative index was accessed in the bitset leading to out of bound exception.

The issue is server sent all types to clients rather than sending only types known to the client, and the client didn't gracefully handle unknown types. I don't think a better type system would have helped the server - that looks like a logic bug, not a typing bug.


what i was trying to say is that you can (often) push this kind of logic bug into the type system - by having, if you like, different types of statuses, or a special type for unknown values (a "maybe status").

if you can do that then you can get the "infallible" compiler to provide the "this branch will not be executed" logic. but there may be efficiency trade-offs, or it may be so directly tied to low level protocols that it is impossible.

five ten years ago i would not have thought of the problem in this way - i would have agreed with you (and the bug report) that it is just logic. but slowly i am starting to learn to rely more on types. but i don't know enough here to suggest details...


Quick question for those that experienced it: did it take down the whole Chrome process, or just a single tab? I'd be very displeased if that happened again and I lost work as a result.


It took down the whole Chrome process.


That's grim and completely unacceptable!


Yeah.... it was a bug. Multiple bugs working in concert, actually.


Switch browsers

or

Contribute to Chrome development.


you can't contribute to Chrome development unless you work at Google, fwiw.


Bugs are a part of software life. Considering the great track record of Google Chrome, I'm not concerned and the issue was resolved pretty quickly. Even Google developers make mistakes as do the rest of us.


Im pretty unconfortable with my browser to have "hardcoded" code to connect me with one walled cloud.. be it google, microsoft or apple..

wheres is the choice? look like these days using anything software or hardware from the tech giants means to be their pets


Forgive me if I've misunderstood, but Chrome Sync is off by default, and there are alternatives, such as XMarks, that you can install instead. Same goes for iCloud Sync in Safari and Firefox Sync too.


Yep, I signed up for lastpass+xmarks for these reasons. I pay them the $20 for mobile access as well, that and I want to support a cross browser solution instead of having everything tied to one vendor.


Sync is strictly opt-in. I know because I've never opted-in.

(It was because I'm worried about just how much information Google is collecting on me, so while I can feel smart for not having a crashing Chrome yesterday, really I was just lucky. There but for the grace of God.)


It's not hardcoded. It's a feature that's opt-in. Firefox has a similar feature and most people find it quite useful. This is a bug in software that caused the browser to crash. It just so happens that the outage caused a certain scenario to arise that caused this particular bug to surface. It's in no way a design flaw. It's just poor implementation.


The choice is that you have to sign in and enable Chrome Sync, it is not on by default: https://support.google.com/chrome/bin/answer.py?hl=en&an...


You should use firefox ... it has the same functionality, and you can host the server if you want.


Make your own browser.


Yep, im doing it.. over the chrome dead body..(source code) but will not be a browser by definition.. its another "thing"

but sure it will respect your data, privacy and civil rights.. This Big Brother thing (under our backs) must stop.


Perhaps you should use WebKit or Chromium as your browser then.


Or perhaps contribute to Chrome development. There seem to be a lot of experts here, I'm sure they'd all be welcomed with open arms.


Chromium has the same code.


Chromium has Chrome Sync??


I don't understand why the crash happened when visiting Gmail if the bug was in Chrome Sync.


Currently getting "500. That’s an error. " page. Seems like even the bug has crashed too...!


Had significant issues with the Chrome Web Store yesterday - images not loading from some content servers, extensions failing to install ('The extension file was not a CRX'). I guess this was the cause.


"That quota service experienced traffic problems today due to a faulty load balancing configuration change. That change was to a core piece of infrastructure that many services at Google depend on. This means other services may have been affected at the same time, leading to the confounding original title of this bug."

Why the downvote?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: