Overview of Text Extraction Algorithms

ajays · on March 19, 2011

This page is just a thin wrapper, with a link to the _actual_ overview: http://tomazkovacic.com/blog/14/extracting-article-text-from...

And here's his list of resources: http://tomazkovacic.com/blog/56/list-of-resources-article-te...

bl4k · on March 19, 2011

I wonder why anybody from HN here read that article and decided to submit it and not the original source

itsnotvalid · on March 20, 2011

The link is changed to pointing to the original article already.

For reference, this post is originally linked to http://www.readwriteweb.com/hack/2011/03/text-extraction.php

grayrest · on March 20, 2011

Readability has a LOT of hand tuned heuristics for figuring out the most likely content of the page, but the primary indicator on whether a tag with text in it is part of an article or not is the number of commas in the tag. It's my favorite thing about the algorithm because it's a dumb idea that works. The comma rule gets the extraction correct on about 70% of the web, the rest of the heuristics are mostly there to cover screwy ways people structure their articles.

pkandathil · on March 20, 2011

Can you provide more detail on the comma rule.

I do text extraction for lynk.ly. First we clean the HTML for the page. Then remove a bunch of tags we consider not useful like script or embed.

Then we looked at all the opening tags and checked if they have specific text in them like "hide", "display:none". If it does, then we skip it the tag.

Finally we get text from a tag if its above a specific threshold.

This seems to have worked best for us.

To see it in action, check it out at lynk.ly

tha-dude · on March 20, 2011

I've been dabbling in content-scraping, what bugs me is that with all the AJAX trickery that's going on, merely analyzing the XHTML source doesn't get you very far in many cases. Executing the page (JS, DOM and all) via browser-programming is an option but of course quite expensive. A headless browser is what's needed!

pkandathil · on March 21, 2011

Yeah. I think that is the challenge. A good way to get around the AJAX problem is to see if a site has an RSS feed and use that to extract content. I wish sites had a url for bots built in so you didnt have to do all this fancy stuff to extract the content.

buss · on March 22, 2011

Many of the big sites will feed you non-ajax content if you're the googlebot.

garply · on March 19, 2011

I studied a fair amount of NLP (a true passion of mine) at school and after I graduated I spent several months working on tech which did this (and other things). That was intended to be a startup, but sadly, at the time, my business sense sucked and I couldn't decide on a good product to fit the tech to (the fact that I was developing tech before I had a strong sense of my product is already telling).

I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.

Still, I hate seeing the code rot...

bl4k · on March 19, 2011

dump it on github - someone will come around and clean it up for you

what is it written in?

garply · on March 19, 2011

Mostly Python. Some C. Some R to play around with datasets and prototype algorithms.

hsmyers · on March 19, 2011

Dump it on a hub and provide a link, or just provide a link to gz or zip version...

epi0Bauqu · on March 19, 2011

What does the tech do exactly?

garply · on March 19, 2011

The bulk of it was:

News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.

buss · on March 19, 2011

I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.

If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....

petercooper · on March 19, 2011

Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo

buss · on March 20, 2011

This looks pretty great, I'll definitely keep an eye on it. Thanks :)

eli · on March 20, 2011

Which python port are you using? Last time I looked all the python Readability code I could find was either incomplete, old, or buggy.

buss · on March 20, 2011

I've experimented with https://github.com/gfxmonk/python-readability but it's extremely slow. There's a decent fork called decruft that is a couple orders of magnitude faster http://www.minvolai.com/blog/decruft-arc90s-readability-in-p...

Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.

hollerith · on March 19, 2011

When I started using the internet in 1992 Usenet (which BTW was almost always referred to as netnews or just plain "news" before Time magazine, etc, used their influence as explainers of the internet to the general public to change the name) was the social heart of the internet the way the web is now, and you did not need algorithms to extract the text from Usenet because the text was all dead-simple plain text files.

rb2k_ · on March 19, 2011

I always wanted to generate a simple service that classifies websites. Something where you dump in the HTML/URL and it returns something like "agriculture"/"government"/"retail"/"education".

I already have a set of a few thousand classifications at hand. What would probably be a good algorithm to run it through? I assume I'd use something like webstemmer/boilerpipe/... to extract just the main text first.

What I am a bit uncertain is what I should do after that. My guess would be that I isolate the nouns/adjectives with the highest frequency and do a clustering with my already categorized dataset as training data.

Does that somewhat makes sense? If yes: any recommendations or alternatives for libraries (preferably ruby) or just algorithms themselves (k-means, svm, neural network...)

flavy · on March 20, 2011

Hi,we are launching a service exactly like the one you were referring to. You can try the API already. Check out the docs https://sites.google.com/site/thinkersrus/products-1/science... and the demo http://www.thinkersr.us/demo.html

rb2k_ · on March 20, 2011

Oh nice, looks interesting and worked pretty well from my few tests. Sadly, with the amount of requests I'd be doing (10000+ just for training), I guess I should build myself something on my own

hooande · on March 20, 2011

For simple text classification, you should just use something like Wekka, rainbow or the Google Prediction API. The hardest part is always labeling and verifying your category training data. There are many open source algorithms that can do the classification and any form of naive bayes will probably be good enough.

buss · on March 19, 2011

I think you'd want something like LDA. See https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation....

rb2k_ · on March 21, 2011

Thanks for the recommendation! Even found a ruby wrapper arround the C lib: https://github.com/ealdent/lda-ruby

PaulHoule · on March 19, 2011

Unless I'm missing something, all the methods he's talking about involve looking at web pages in isolation, or, alternatively across the set of all web pages.

To do "template drop out", it would seem productive to look longitudinally across pages on a single site, or in a subdirectory. For instance, almost all pages in Hacker News have the same chrome. Methods used for DNA clustering (such as Hidden Markov Models) could quickly find 'conserved' and 'unconserved' areas of documents.

This touches semantic technology because it links the ability to find nameless statistical patterns with meaningful semantic identifiers, such as domain names.

andraz · on March 19, 2011

Methods that do clustering on similar web pages are mostly too CPU intensive for processing larger sets (we're talking millions of web pages). They are also harder to scale from data-locality perspective, you need to figure out which pages to put together and then get the data together.

Looking at pages in isolation is much more horizontally scalable. You can take a look at Webstemmer (http://www.unixuser.org/~euske/python/webstemmer/index.html) for a method exploiting similarities.

PaulHoule · on March 20, 2011

Great reply. However, I think something is only worth doing if it's impossible.

Your argument is like Chomsky's argument about the poverty of the stimulus, just in reverse. There are heuristics that let us radically prune the N^2 possible relationships between things into a much smaller set that will let us do things that would be otherwise unscalable.

andraz · on March 20, 2011

Let me know if you know of this approach being used somewhere in production processing millions of web pages. I would be very interested to know how they overcome the difficulties!

I can imagine the cost/benefit of the approach is favorable for largest search engines like Google and Bing that are trying to squeeze last few percentage points of precision out of results. For everybody else, the engineering and scaling difficulties are probably too big. I'd love to be proven wrong.

PaulHoule · on March 20, 2011

Google and Bing are doing billions of web pages, not millions. I process millions of web pages myself with 3 computers -- millions aren't a lot these days, although I'm not currently using clustering methods.

Rather I'm selective with my inputs so I start with unscrambled eggs so I can improve precision not by "a few percentage points" but rather reduce the false positive rate by an order of magnitude.

My use of ML so far has been modest, limited to solving a few straightforward problems. Personally I think search is boring (on webscale, too big of a game for small players plus search as we know it probably can't get much better because the queries are not precise -- better performance will require changing the game) but I've been forced to put effort into it because end users expect it.

TorKlingberg · on March 20, 2011

I have been thinking there should be a way in html to mark the main part of a web page, as opposed to the header, navigation, footer, etc. It could be used when printing, by screen readers, search engines, Readability. I don't know if the W3C would approve such a tag, or if site owners would bother to use it though.

semanticist · on March 20, 2011

HTML 5 has the 'article' tag. There's also header, footer, and other semantic mark-up tags.

I've spent the last month writing scrapers for newspaper sites. No one uses any of these things.

buss · on March 21, 2011

Are the scrapers hand-tuned to each website? If not how did you do it?

semanticist · on March 21, 2011

We need to extract site-specific metadata, and I'm no expert at NLP, so it's mostly one scraper per site.