Show HN: Readability-like API Using Machine Learning

petercooper · on March 11, 2011

Diffbot's stuff is in a different league (but it's a hosted service with a large dataset) but if anyone's vaguely interested in this area, I've been working on a Ruby library that performs some similar features: https://github.com/peterc/pismo

It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!

I should also point out BoilerPlate - http://code.google.com/p/boilerpipe/ - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.

Last but not least, a paper by the aforementioned PhD student called Boilerplate Detection using Shallow Text Features is available at http://www.l3s.de/~kohlschuetter/boilerplate/

I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.

itsnotvalid · on March 11, 2011

That is good stuff. How is http://coder.io using these?

alok-g · on March 11, 2011

Are some examples or online trial available for pismo?

bravura · on March 11, 2011

There is an online demo of boilerpipe: http://boilerpipe-web.appspot.com/

I use boilerpipe a lot and highly recommend it. We've discussed it before on MetaOptimize: http://metaoptimize.com/qa/questions/3440/text-extraction-fr...

I ran a few qualitative tests on Diffbot's "Article API" and the results also look good. I haven't gotten a chance to run a detailed or quantitative comparison.

petercooper · on March 11, 2011

Nothing specific because it's really just a library, though I'm now considering this for the next version :-) However, http://coder.io/ leans heavily on it and all of the summaries and titles there come from it.

modeless · on March 10, 2011

Cool, works better than other services like this I've found. I tried it on Ars Technica's review of the Xoom tablet and it found all 10 pages. It didn't find the embedded video though. Also, all the formatting is stripped which makes it hard to differentiate section headers from content paragraphs, and all the images are in one list to the side, removed from their original context.

What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.

miket · on March 10, 2011

Right now, the API just returns back the raw text for simplicity's sake, but it would be possible to make an option for returning a bit of HTML structure, which would address the problem of sections, inline images, tables, etc.

The combination of the two APIs is a great idea.

bravura · on March 11, 2011

It would be great if you could return a normalized and simplified version of the HTML structure. I know a lot of people who would be interested in this.

dfgonzalez · on March 15, 2011

Yes, textile please!

tansey · on March 10, 2011

I really like this.

You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.

miket · on March 10, 2011

Great suggestion, you can use the article API already to do this (providing the URL of the post and the 'tag' parameter), but maybe you are thinking of analyzing other types of content? Or perhaps the ability to POST your own text data instead of just the URL?

tansey · on March 10, 2011

Exactly. For instance, if a user is typing in a blog post, right now the flow would be:

1) User submits post

2) I create a page for their post

3) I call your API to analyze the page

4) I update the page with the auto-tags

5) I redirect the user to the post

This is kind of slow. I could do it with AJAX calls too, but it's still an awkward flow. A better flow would be:

1) User submits post.

2) I pass the text of the post to your service.

3) You analyze the text and send me back the tags.

4) I add the tags into the post and create the URL.

5) I redirect the user to the URL.

This is a much more natural flow.

There are other solutions in this space, like Zemanta, but they generally suck. If I enter a term like "social network" into Zemanta it will tell me Google Buzz is a tag... which is ridiculous.

flavy · on March 11, 2011

Funny thing, we have a start-up in submission to YC that should help with that. We hope to release a demo to HN in a couple of days.

tansey · on March 11, 2011

Awesome... can I have a beta invite? :)

flavy · on March 14, 2011

Hi, here is the link to detailing what you can try out right now. The official beta launch is going to be later on this week, but you can get started using it anytime you like: https://sites.google.com/site/thinkersrus/products-1/science...

Enjoy! :)

quan · on March 10, 2011

It looks like something I would integrate for my current project but your term suggests that the api is only for personal and non-commercial uses.

miket · on March 10, 2011

Not at all, feel free to use it for commercial uses. I'm removing that from the terms.

jranck · on March 10, 2011

This looks great. I'd love to find out more about your API and what type of web scraping techniques you're using. It looks like this is going to be available publicly to developers? What type of usage do you guys allow?

miket · on March 10, 2011

I'd be happy to talk about the visual-based statistical classification technology in a follow-up blog post if there's interest.

kmfrk · on March 10, 2011

There is(!)

As someone who's right in the middle of Mining the Social Web, this almost seems too coincidental to be true.

jamongkad · on March 11, 2011

I'm an interested party as well. As a student of machine learning I would love to learn the techniques you've applied. I'm doing a data mining startup and this would be very helpful.

alextp · on March 11, 2011

Visual classification sounds interesting. So you render the page and use location-based features to extract content?

miket · on March 11, 2011

In short, yes. The key innovation is that we've come up with a lossy, fixed-length representation of the visual features that we can use to do classification upon. I'll try to do a more detailed writeup on our blog when I find some time.

btw, our blog has no rss feed, but you can just use our RSS API :-) http://www.diffbot.com/api/rss/http:/www.diffbot.com/blog

itsnotvalid · on March 11, 2011

+1 for good computer science writeups

aaronkaplan · on March 11, 2011

Article content starts out in a straightforward, easy-to-process form, as created by the reporter/author, in a content management system. Then the CMS chops it up into pages and adds boilerplate for presentation as a web page. Then you expend lots of effort to stick the pages back together and filter the crap back out, to arrive at an approximation of the original. Generally a noisy, imperfect approximation that is less useful for your purposes (indexing, information extraction, etc).

If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.

Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?

itsnotvalid · on March 11, 2011

I like the "machine learning" part of the api, but there seems to be no way of improving the learning by giving feedback.

Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)

[1]: http://daringfireball.net/2011/03/bending_over_backwards

[2]: tags received: Recyclable materials, Battery, 4G, Apple Inc., Rechargeable battery, Walter Mossberg, Technology, Computing, Manufacturing, Technology_Internet

mcfchan · on March 10, 2011

That's a really nifty API. Performance is nice too. Would be interested in knowing more about it.

shantanubala · on March 10, 2011

This is fantastic! How resource-intense is it to run? Machine Learning, depending on the implementation, can be pretty costly from what I understand.

miket · on March 10, 2011

It's fairly CPU intensive. Like many classification-based techniques, much of the computation is in constructing the features. For our case, this means we had to implement most of CSS to get the visual features of every element on the page.

pjscott · on March 11, 2011

Can you quantify that? Like, on a reasonable computer, how long does it take to process the Huffington Post example article? And (though this may be a little trickier to benchmark) how many such articles can you process per second?

Is this something you can do fast enough that the delays wouldn't be user-perceptible? Or maybe something that would be feasible to do in batch mode for a bunch of articles?

(In any case, it's amazing. Congratulations on making this.)

miket · on March 11, 2011

I just timed that Huffington Post example on my computer and it took around 80ms to extract at 100% CPU. The rub though is that, if you type in a totally new URL on our demo page, most of the time is actually spent in downloading the page from the original source and its associated CSS files (which we use to do visual rendering).

npp · on March 11, 2011

In an earlier comment, you said that much of the time was spent in constructing the features (e.g. you had to implement CSS). Did you mean implementation time, or training/classification time? This latest comment makes it sound like most of the time is in downloading the page, while the feature extraction is relatively fast.

In any case, if the feature extraction is taking too much time, what is sometimes done is to dynamically select which features to extract for a test example based both on the expected predictive value (e.g. via mutual information or some other feature selection method) as well as the time it takes to actually compute the feature. This can be measured by, say, average computation time per feature on the training set. This can speed things up a fair bit if the feature extraction takes too long, since you only bother computing the features you really need, and are biased towards the ones that are quick to compute. This may not translate to your particular application, though, if I remember correctly, I've seen it used a while back for image spam classification.

bravura · on March 11, 2011

Feature selection is an option, but not if all features require a certain preprocessing step.

My guess is that they need to render the page so they can determine the visual layout. So regardless of which visual features they use, the rendering step cannot necessarily be avoided.

bravura · on March 11, 2011

Would you expose different portions of your pipeline?

For example, some of us might be interested in the CSS parsing and feature extraction, but in an alternate machine learning technique.

If you are going to assume tech savvy users, then you might as well expose low-level functionality and see if people like it.

ajays · on March 11, 2011

What do you use to get the visual CSS rendering of the elements on the page? A library like libmoz, perhaps?

ronnoch · on March 10, 2011

The tagging feature is very impressive.

sqrt17 · on March 10, 2011

Duh... for the pages I tried, it always gives me the "No article at this URL".

I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.

Does anyone know an existing solution for this?

miket · on March 10, 2011

We have a separate API for non-article pages (It's called the Follow API). It's not well documented yet, but you can get an idea for it from this demo: http://www.diffbot.com/mobilizer which will turn any webpage into a mobile version. Try putting in http://techcrunch.com or your forum thread page.

buss · on March 11, 2011

I noticed that this will happen for NYT articles behind their registration-wall.

alexdong · on March 11, 2011

We are using http://purifyr.com/ for this. Pretty happy with the unicode support and 20-50 documents per second speed.

Mamady · on March 12, 2011

Looks good but doesn't work for wikipedia.

Get it working there and you will have a lot more consumers.

miket · on March 14, 2011

I tried a couple wikipedia pages and it seemed to work ok. Can you email me an example?

dfgonzalez · on March 15, 2011

I sent a token request almost a week ago, do you send positive and negative answers on this?

Thanks

MrVitaliy · on March 10, 2011

Would this make google's job on removing content aggregator slightly harder?

normaluser · on March 11, 2011

bitanarch · on March 10, 2011

Looks awesome.

kevingao1 · on March 10, 2011

Very interesting...

atse · on March 10, 2011

I agree. This guy Mike seems so smart

rickyyean · on March 10, 2011