I tested a lot of these services and libraries a while ago as part of developing a product that required extracting article text and metadata from a URL.
The best service, and it won by some margin, was Diffbot (www.diffbot.com). I ran comparisons between approx 20 different services and libraries and it won by some margin. It uses machine learning rather than regular expressions or per-site filters, and the engine has been extensively trained (I threw a lot of edge cases at it, which improved it). There seem to be a lot of similar services that do well with common cases but completely fall apart when applied broadly.
So to the author of this service - what features or examples do you have that distinguish your implementation from others? What is the technique being used here?
I set out to build my own interpretation and that's what I did. It does have an automatic extraction pattern, but also uses per side rules, if available.
I'd say the main distinguishing factor is the price point: free.
I cannot, in good conscience, charge for just scraped content. I mentioned it before, but once I complete testing, I'll release the full source on GitHub
I used to work for an online editorial in Spain, which did exactly that: charge for scraped content. You can't imagine how many customers there are for this kind of thing.
Then figure out why developers would use your service and charge them for making it easier to achieve that goal. I'm sure that scraping content is just a means to and end.
Looks nice, we should talk, I run a service that does the same (and more): http://www.feedsapi.com , where are you based in Switzerland, I was in Bienne a couple of months ago and based in Germany. I will drop you a mail shortly.
I'm using it in http://readapp.net & my upcoming HN News App, so it isn't schedule to disappear any time soon. Send me an email, so we can discuss this further, if you'd like to.
This is great. I made a personal periodical for myself using readability and it worked, but was a pain in the ass. This is exactly what I should've built first.
Warning: file_put_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-put-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 27
Warning: file_get_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-get-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 31
Warning: Cannot modify header information - headers already sent by (output started at /home/mackh_vps/api.thequeue.org/v1/clear.php:27) in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 55
aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==Invalid URL
Also, maybe turn off display_errors and turn on log_errors in your php.ini.
I think they want it so that if they send an Accept header in the request that asks for json, you reply with json, instead of using a query parameter to specify the format.
Shameless Plug: http://www.feedsapi.com supports JSON , you might want to check it out, and drop me a mail if you have any special use-case or question.
The best service, and it won by some margin, was Diffbot (www.diffbot.com). I ran comparisons between approx 20 different services and libraries and it won by some margin. It uses machine learning rather than regular expressions or per-site filters, and the engine has been extensively trained (I threw a lot of edge cases at it, which improved it). There seem to be a lot of similar services that do well with common cases but completely fall apart when applied broadly.
So to the author of this service - what features or examples do you have that distinguish your implementation from others? What is the technique being used here?