Hacker News new | past | comments | ask | show | jobs | submit login
Like Instapaper, but for Developers (thequeue.org)
161 points by mmackh on Feb 29, 2012 | hide | past | favorite | 48 comments



I tested a lot of these services and libraries a while ago as part of developing a product that required extracting article text and metadata from a URL.

The best service, and it won by some margin, was Diffbot (www.diffbot.com). I ran comparisons between approx 20 different services and libraries and it won by some margin. It uses machine learning rather than regular expressions or per-site filters, and the engine has been extensively trained (I threw a lot of edge cases at it, which improved it). There seem to be a lot of similar services that do well with common cases but completely fall apart when applied broadly.

So to the author of this service - what features or examples do you have that distinguish your implementation from others? What is the technique being used here?


I set out to build my own interpretation and that's what I did. It does have an automatic extraction pattern, but also uses per side rules, if available. I'd say the main distinguishing factor is the price point: free.


Which if you want people to use it with confidence probably isn't a selling point:

http://blog.pinboard.in/2011/12/don_t_be_a_free_user/


I cannot, in good conscience, charge for just scraped content. I mentioned it before, but once I complete testing, I'll release the full source on GitHub


I used to work for an online editorial in Spain, which did exactly that: charge for scraped content. You can't imagine how many customers there are for this kind of thing.


Then figure out why developers would use your service and charge them for making it easier to achieve that goal. I'm sure that scraping content is just a means to and end.


Fair enough, but I'm a whole lot less likely to integrate an API that could break or disappear with no recourse.


Would you care to write a blog post about your comparison?

Or, at least, list the 20 services + libs you used?


Looks nice, we should talk, I run a service that does the same (and more): http://www.feedsapi.com , where are you based in Switzerland, I was in Bienne a couple of months ago and based in Germany. I will drop you a mail shortly.


Any chance you can expand this as a "real" service, i.e. one with a guaranteed service level for a monthly fee?

I would love to use this in an iPhone app I am building, but I am obviously wary as it may disappear/go offline at any point.

I would gladly pay a monthly subscription to use it.


I'm using it in http://readapp.net & my upcoming HN News App, so it isn't schedule to disappear any time soon. Send me an email, so we can discuss this further, if you'd like to.


Interesting service. You've got several typos and some awkward phrasing in the text under http://readapp.net/pub.html though.


Thanks, will rewrite this today


thanks, will send you an email.


What text extractor engine are you using?


I'm building this on top of a PHP port of readability.



Really nice could you explain more about this engine


I tried this on https://www.xkcd.com/386/ , but http://api.thequeue.org/v1/clear?url=https://www.xkcd.com/38... just extracted the content disclaimer and Creative Commons license notice at the bottom of the page: "Warning: this comic contains [...] This work is licensed under [...]".


Should be all fixed now, please let me know if it works for you.


Somewhat better now, but it still grabs a lot of the boilerplate too.

(Also, out of curiosity, what did you change to fix it?)


I added a set of rules for this page. Unfortunately the DOM isn't easily broken down, thus the extra clutter.


Sweet. We should talk. I run a similar project at http://www.rtcool.com/


Thanks. You should avoid underlined text for non-links, though.


It would be really great if, for shortened links, it also provided the final url


Try now


Great stuff, thank you


This is great. I made a personal periodical for myself using readability and it worked, but was a pain in the ass. This is exactly what I should've built first.


Thanks a lot. It needs to improve a bit I guess but a great beginning, always wanted such an API.


Just what ive been looking for! Will definitively use it sooner or later.


You might want to stop it from opening local files.


Could you please elaborate?


Probably they're talking about the implication of errors such as in the result of http://api.thequeue.org/v1/clear?url=http://news.ycombinator...:

    Warning: file_put_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-put-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 27

    Warning: file_get_contents(db/aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==.clr) [function.file-get-contents]: failed to open stream: No such file or directory in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 31

    Warning: Cannot modify header information - headers already sent by (output started at /home/mackh_vps/api.thequeue.org/v1/clear.php:27) in /home/mackh_vps/api.thequeue.org/v1/clear.php on line 55
    aHR0cDovL25ld3MueWNvbWJpbmF0b3IuY29tL2l0ZW0/aWQ9MzY0NjYyNw==Invalid URL
Also, maybe turn off display_errors and turn on log_errors in your php.ini.


Thanks, the ? character was causing issues


Thanks man, you quite likely made my day!


any chance you'll add JSON support?


Try adding &format=json and let me know if you run into any bugs


How's about sum Accept headers up in there?


Could you point me in the right direction?


Accept: application/json


I'm currently using header('Content-type: application/json'); Source: http://stackoverflow.com/questions/267546/correct-http-heade...


I think they want it so that if they send an Accept header in the request that asks for json, you reply with json, instead of using a query parameter to specify the format.

https://developer.mozilla.org/en/HTTP/Content_negotiation has more details about the accept header and its use.


Shameless Plug: http://www.feedsapi.com supports JSON , you might want to check it out, and drop me a mail if you have any special use-case or question.


too awesome! I like it and it couldn't have come at a better time. Made my day as well :)


Very cool. I could totally use it in a future iPhone app. Pinterest fever alert! :)


Hey thanks for using my post as an example! ;)


I'm glad there is JSON support. Does anyone else think XML should die a painful death?


Very cool, good work :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: