Like Instapaper, but for Developers

nikcub · on Feb 29, 2012

I tested a lot of these services and libraries a while ago as part of developing a product that required extracting article text and metadata from a URL.

The best service, and it won by some margin, was Diffbot (www.diffbot.com). I ran comparisons between approx 20 different services and libraries and it won by some margin. It uses machine learning rather than regular expressions or per-site filters, and the engine has been extensively trained (I threw a lot of edge cases at it, which improved it). There seem to be a lot of similar services that do well with common cases but completely fall apart when applied broadly.

So to the author of this service - what features or examples do you have that distinguish your implementation from others? What is the technique being used here?

mmackh · on Feb 29, 2012

I set out to build my own interpretation and that's what I did. It does have an automatic extraction pattern, but also uses per side rules, if available. I'd say the main distinguishing factor is the price point: free.

kolektiv · on Feb 29, 2012

Which if you want people to use it with confidence probably isn't a selling point:

http://blog.pinboard.in/2011/12/don_t_be_a_free_user/

mmackh · on Feb 29, 2012

I cannot, in good conscience, charge for just scraped content. I mentioned it before, but once I complete testing, I'll release the full source on GitHub

mrleinad · on Feb 29, 2012

I used to work for an online editorial in Spain, which did exactly that: charge for scraped content. You can't imagine how many customers there are for this kind of thing.

rudasn · on Feb 29, 2012

Then figure out why developers would use your service and charge them for making it easier to achieve that goal. I'm sure that scraping content is just a means to and end.

eli · on Feb 29, 2012

Fair enough, but I'm a whole lot less likely to integrate an API that could break or disappear with no recourse.

bravura · on Feb 29, 2012

Would you care to write a blog post about your comparison?

Or, at least, list the 20 services + libs you used?

Concours · on Feb 29, 2012

Looks nice, we should talk, I run a service that does the same (and more): http://www.feedsapi.com , where are you based in Switzerland, I was in Bienne a couple of months ago and based in Germany. I will drop you a mail shortly.

andysinclair · on Feb 29, 2012

Any chance you can expand this as a "real" service, i.e. one with a guaranteed service level for a monthly fee?

I would love to use this in an iPhone app I am building, but I am obviously wary as it may disappear/go offline at any point.

I would gladly pay a monthly subscription to use it.

mmackh · on Feb 29, 2012

I'm using it in http://readapp.net & my upcoming HN News App, so it isn't schedule to disappear any time soon. Send me an email, so we can discuss this further, if you'd like to.

eli · on Feb 29, 2012

Interesting service. You've got several typos and some awkward phrasing in the text under http://readapp.net/pub.html though.

mmackh · on Feb 29, 2012

Thanks, will rewrite this today

andysinclair · on Feb 29, 2012

thanks, will send you an email.

dabeeeenster · on Feb 29, 2012

What text extractor engine are you using?

mmackh · on Feb 29, 2012

I'm building this on top of a PHP port of readability.

mmorey · on Feb 29, 2012

Are you using this port http://code.fivefilters.org/p/php-readability/ ?

TCS · on Feb 29, 2012

Really nice could you explain more about this engine

JoshTriplett · on Feb 29, 2012

I tried this on https://www.xkcd.com/386/ , but http://api.thequeue.org/v1/clear?url=https://www.xkcd.com/38... just extracted the content disclaimer and Creative Commons license notice at the bottom of the page: "Warning: this comic contains [...] This work is licensed under [...]".

mmackh · on Feb 29, 2012

Should be all fixed now, please let me know if it works for you.

JoshTriplett · on Feb 29, 2012

Somewhat better now, but it still grabs a lot of the boilerplate too.

(Also, out of curiosity, what did you change to fix it?)

mmackh · on Feb 29, 2012

I added a set of rules for this page. Unfortunately the DOM isn't easily broken down, thus the extra clutter.

lowglow · on Feb 29, 2012

Sweet. We should talk. I run a similar project at http://www.rtcool.com/

johncoltrane · on Feb 29, 2012

Thanks. You should avoid underlined text for non-links, though.

neiljohnson · on Feb 29, 2012

It would be really great if, for shortened links, it also provided the final url

mmackh · on Feb 29, 2012

Try now

neiljohnson · on Feb 29, 2012

Great stuff, thank you

endlessvoid94 · on Feb 29, 2012

This is great. I made a personal periodical for myself using readability and it worked, but was a pain in the ass. This is exactly what I should've built first.

digamber_kamat · on Feb 29, 2012

Thanks a lot. It needs to improve a bit I guess but a great beginning, always wanted such an API.

einarlove · on Feb 29, 2012

Just what ive been looking for! Will definitively use it sooner or later.

sidolin · on Feb 29, 2012

You might want to stop it from opening local files.

mmackh · on Feb 29, 2012

Could you please elaborate?

randallsquared · on Feb 29, 2012

Probably they're talking about the implication of errors such as in the result of http://api.thequeue.org/v1/clear?url=http://news.ycombinator...:

    Warning: file_put_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-put-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 27

    Warning: file_get_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-get-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 31

    Warning: Cannot modify header information - headers already sent by (output started at /home/mackh_vps/api.thequeue.org/v1/clear.php:27) in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 55
    aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==Invalid URL

Also, maybe turn off display_errors and turn on log_errors in your php.ini.

mmackh · on Feb 29, 2012

Thanks, the ? character was causing issues

gillesguillemin · on Feb 29, 2012

Thanks man, you quite likely made my day!

n8ji · on Feb 29, 2012

any chance you'll add JSON support?

mmackh · on Feb 29, 2012

Try adding &format=json and let me know if you run into any bugs

wseymour · on Feb 29, 2012

How's about sum Accept headers up in there?

mmackh · on Feb 29, 2012

Could you point me in the right direction?

lysol · on Feb 29, 2012

Accept: application/json

mmackh · on Feb 29, 2012

I'm currently using header('Content-type: application/json'); Source: http://stackoverflow.com/questions/267546/correct-http-heade...

Robin_Message · on Feb 29, 2012

I think they want it so that if they send an Accept header in the request that asks for json, you reply with json, instead of using a query parameter to specify the format.

https://developer.mozilla.org/en/HTTP/Content_negotiation has more details about the accept header and its use.

Concours · on Feb 29, 2012

Shameless Plug: http://www.feedsapi.com supports JSON , you might want to check it out, and drop me a mail if you have any special use-case or question.

ale55andro · on Feb 29, 2012

too awesome! I like it and it couldn't have come at a better time. Made my day as well :)

dragosstancu · on Feb 29, 2012

Very cool. I could totally use it in a future iPhone app. Pinterest fever alert! :)

balsamiq · on Feb 29, 2012

Hey thanks for using my post as an example! ;)

TamDenholm · on Feb 29, 2012

I'm glad there is JSON support. Does anyone else think XML should die a painful death?

robmcm · on Feb 29, 2012

Very cool, good work :D