Hacker News new | past | comments | ask | show | jobs | submit login

Although not anyone can crawl 2+ billion pages AND fix all the crazy edge cases without going insane :)



Who says I didn't go insane? ;-)

The crazy edge cases were...challenging. The source code to the parser is very assert-heavy, so if there's anything that's amiss, it tends to blow up with an assertion failure. I'd run the MapReduce and it would blow up a few hundred times, then MapReduce would stop trying and kill the job. Then when I had a spare moment, I'd look at the assertion failures, pick off the most common ones, and run it again. This time it would get farther, I'd pick off another couple of bugs, and run it again.

As expected, the triggering frequency of bugs follows a power-law distribution. It took a long time before I could get it to parse one HTML document, and then it would fail on 1% of documents, then 0.1% of documents, then 0.01%, and so on. It got stuck at a roughly 1-in-a-million failure rate by a long time, until I figured out that it was crashing because of a stack overflow in the testing code, which would recursively sanity-check the produced DOM. Some documents generate a DOM >20,000 nodes deep, which is evidently too much to fit in typical C stacks, although Gumbo can handle them. (I found one page with a DOM tree 100,000 nodes deep - it was really an XML document masquerading as HTML, with a bunch of self-closing nodes that don't self-close under HTML5 parsing rules - and when I posted the link to say "Look what I found!", I got a bunch of "Kind of a dick move, linking to a page that crashes Webkit.")


I like programming assert-heavy C. As soon as something is out of wack with my mental model the whole thing explodes. I sure as hell don't want to try handling things I don't already understand.


You can use asserts in an exploratory fashion to document code you don't understand too.


Not so good for code that runs in servers, since you have no chance to drop requests or degrade gracefully.


If you hit it with enough examples of input as was done here, you can be fairly confident. But you're right, short of that, assert heavy code is going to cause headaches in servers. C is a nice and quick language to run tests on which makes it doable.


I'd be interested in reading about your learning when moving from a UI -> Algo heavy engineer :)


It's a long story, and it's also not complete yet (I'm actually doing very UI heavy work right now as a tech lead). It's also not really correct to say it started with UI - I was big into programming language theory in college, even implementing a bunch of toy interpreters/compilers, one of which even got some measure of fame on the Internet.

The 5 second overview is really that it's the same as getting good at any new skill. You find an area that you don't know how to do, and then keep working at it until you do know how to do it. Then repeat with finer-grained details. There were a bunch of skills involved in this project - C, HTML5, UTF-8 decoding, debugging, testing, autotools, CTypes, API design, documentation - that I wasn't all that good at when I started that I had to pick up along the way.


This is the most inspiring thing I've read in a long time. There's a lot of chatter on HN about how people became an "expert" in this or that, but for some reason, the way you phrased it really resonated with me. And to see the end result -- holy crap. HTML is complicated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: