News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.
I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.
Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo
Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.
News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.