Hacker News new | past | comments | ask | show | jobs | submit login

The bulk of it was:

News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.




I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.

If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....


Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo


This looks pretty great, I'll definitely keep an eye on it. Thanks :)


Which python port are you using? Last time I looked all the python Readability code I could find was either incomplete, old, or buggy.


I've experimented with https://github.com/gfxmonk/python-readability but it's extremely slow. There's a decent fork called decruft that is a couple orders of magnitude faster http://www.minvolai.com/blog/decruft-arc90s-readability-in-p...

Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: