from TFA:
The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:
1. Parse the HTML code and keep track of the number of bytes processed.
2. Store the text output on a per-line, or per-paragraph basis.
3. Associate with each text line the number of bytes of HTML required to describe it.
4. Compute the text density of each line by calculating the ratio of text to bytes.
5. Then decide if the line is part of the content by using a neural network.
You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning — not to mention that it’s easier to implement!
Only Google has it shit-listed, therefore Firefox does too by extension.
Opera 9.5 and IE 7 (both w/ phishing detection as well) don't flag it. Google's malware detection engine has been known to be off plenty of times before - lost a startup I know a month's income because some bad sites were linking to them, before they apologized and undid their blacklisting.
Maybe that's the trick... It's a meta-article about how to _really_ get text from HTML. By figuring out you can just dump from lynx - you no longer have to read it!
The patterns could be formed in a more formalized way by applying Shannon's principles of information entropy across the document, lines of text, word patterns, N-Grams, etc. Then Bayesian inference can be applied for probabilistic pattern matching (vs the neural network).
Not that this approach is any easier, its just perhaps more robust for applying to other problem sets.
Nothing new of course, according to their website, Autonomy (Europe's 2nd largest software company) uses these techniques as the basis for their core technology to analyze text, audio, video, etc.
It's an interesting idea and probably useful in some situations. It would be much more useful, though, if the parser kept structural information about where in the html tree a particular text fragment was found. Lines could still serve as the unit to which statistical analysis is applied (although that seems error prone), but knowing more about the structure would enable further processing down the line.
for whose who too scared to follow link...