Hacker Hacker News - see just the programming/math/science links from HN

sqs · on July 6, 2009

I made this in a few hours for my own personal use and thought that other people might want to use it, too. It uses a Python Bayesian classifier trained on about 200 hand-labeled links and outputs a static HTML file. It's not perfect (and the technical/non-technical distinction is arbitrary), but I'll give it some more training data and see if it improves. Also, I don't mean to disparage non-technical topics with this; sometimes I just want to read about programming, though

I'd appreciate any feedback on this.

patio11 · on July 6, 2009

I worked in Bayesian spam filters for a while at the last day job. Wonderful things, aren't they?

I've got to say, your classifier looks like it is amazingly accurate for those distinctions. I'd be really interested in seeing some of the very determinative words. (There are some obvious candidates like "Erlang" and "Twitter" but every time I do a natural language processing project I'm amazed by how our distribution of "little meaningless words" changes markedly between FeatureA and FeatureB.)

sqs · on July 6, 2009

Yeah, it's amazingly accurate for now only because the training data consists of a high percentage of the actual data set, and so it's probably classifying based on those insignificant differences. We'll see how it holds up...

10ren · on July 6, 2009

I think you're training it on the submission titles - I wonder if the text of the websites themselves might be more accurate. Certainly richer. But it's quite possible that the submission titles are more accurate.

jlees · on July 6, 2009

Says above he's using body text.

paraschopra · on July 6, 2009

Why don't you provide a flag kind of button for each article that lets user train the classifier online?

sqs · on July 6, 2009

That's a great idea. Right now it's just spitting out a static HTML file, so that'd require some significant changes to the code. I'll see how it's doing tomorrow and consider making it into something like what you describe.

iamelgringo · on July 6, 2009

Are you using the reverend thomas library? http://www.divmod.org/trac/wiki/DivmodReverend

I've been wanting to use that library for something just like this. I would love to read a write up of this, or see the code at some point.

Thanks

sqs · on July 6, 2009

Yeah, that's exactly the Bayesian classifier library that I'm using. I'll have the code up in a few hours.

edit: source at http://hackerhackernews.com/hhn-20090706.tgz

sqs · on July 6, 2009

and mercurial at http://bitbucket.org/sqs/hhn/overview/

thomaspaine · on July 6, 2009

This sounds pretty awesome. Would you mind posting some of the code?

I'd be curious to know how you're formatting the input data. If you're just using the title, all of the text parsed through BeautifulSoup, some parsing algorithm to obtain the body of the article, etc.

Not too long ago, I wrote a python script to automatically extract the bodies of articles, which you may find useful: http://blog.davidziegler.net/post/122176962/a-python-script-...

sqs · on July 6, 2009

Thanks. I'll definitely post the code soon. I'm using the title, the submitter's name, the URL, and the page content without tags (only looking at <p> and similar tags would probably help). Including the content of the HN comments pages might help, too. Thanks for the link; I'll check it out.

sqs · on July 6, 2009

OK, the source is here: http://hackerhackernews.com/hhn-20090706.tgz

jrockway · on July 6, 2009

Excellent work; I am worried that I am going to have to start my own social news site to get away from all the time-wasting general interest stuff here. This will save me from having to do that for a while.

ersi · on July 6, 2009

That doesn't sound very social. Aren't most programming/math/science also "time-wasting", so to speak?

jrockway · on July 6, 2009

People want to read about programming. Right now, there are three options:

* News.YC, which rarely has articles about programming anymore (perhaps one or two a day).

* Programming Reddit, whose commentators think every article is a jab at The One True Idea, which they seem to think is PHP.

* Lambda the Ultimate, which only covers a small area of programming; language design and implementation

I want a site like LtU but with a broader appeal.

scott_s · on July 6, 2009

On the front page right now, there are eight programming submissions, and nine web development submissions.

(There is overlap with the programming and web development, but something like usability guidelines fits in web development but not programming. By my definition, anyway.)

jules · on July 6, 2009

> * Lambda the Ultimate, which only covers a small area of programming; language design and implementation

Implementation (!= Programming Language Theory) is off topic for LTU, unfortunately.

Andys · on July 7, 2009

I came to news.yc because I wanted to read about startups. There are very few % of articles about startups now.

(I am sticking around because of the general interest articles which is what I used to visit reddit for before it became too popular)

markbao · on July 6, 2009

Now we need Startup Hacker News.

jlees · on July 6, 2009

What about a generic My Hacker News with a classifier trained on each person's personal "I want to see / I do not want to see" preferences? :>

mahmud · on July 6, 2009

HN doesn't need subreddits.

mbrubeck · on July 6, 2009

I don't want to see the community fragmented into walled gardens like subreddits (I'd rather see everyone focus on submitting the best possible links to the front page), but it would be interesting to have a front-page algorithm (or an additional page on "lists") that tries to predict not the best links, but my best links.

diN0bot · on July 6, 2009

why? I notice you've received (18) upvotes, which in this case probable means agreement, so there's obviously some support for not having this feature.

Is it specifically subreddits to avoid? I completely agree that subreddits are a terrible implementation of user-content customization because they force items into a single-parent hierarchy.

However, categories that filter content according to different weights would be quite useful. By pre-selecting these categories and thus pre-computing them, it would even be reasonable to implement (actual per-user filters might not work so well in real-time).

scott_s · on July 6, 2009

The fear I have is it would splinter the community.

coconutrandom · on July 6, 2009

But HN has become so big. It not logical to think that everyone will be interested by everything now.

scott_s · on July 6, 2009

I'm not convinced the community has become so big that subdivisions are necessary.

magv · on July 6, 2009

> Page generated automatically at 23:49 on 05 July 2009

Can you specify the timezone?

Also, how often is the page updated?

sqs · on July 6, 2009

I'll add the timezone (PDT). It's updated whenever I run a command to update the HTML and scp that to the server. Nothing automated about the process yet.

dgallagher · on July 6, 2009

I think you've stumbled on something that Hacker News needs: Categories.

astine · on July 6, 2009

I think that we'd be better with tagging than categories per se. Give people a finate list of tags to choose from and people could customize their views pretty easily.

zimbabwe · on July 6, 2009

Newsvine is a good example of tags making a community more focused and easier to handle.

vaksel · on July 6, 2009

as long as the same story can't be submitted by 2 different people by submitting it into different catetgories

prakash · on July 6, 2009

This is really cool. Maybe you can use these links to train: (http://www.gabrielweinberg.com/startupswiki/Ask_YC_Archive)

sosuke · on July 6, 2009

I'd love to see a threshold setting. Something like "show only above 50%" so that your visitors can adjust the visible links even further. Otherwise I love the site :)

prakash · on July 6, 2009

Have you considered applying to the YC program with this/variant of this idea?

mattmaroon · on July 6, 2009

Could you make a page that is the exact opposite? Non Hacker Hacker News?

alexgartrell · on July 6, 2009

The reason hacker news sucks less than other places is cause when people come up with cool mashups that could be interpreted negatively (you don't like all the news chosen by the community, whaaatttt?!), no one gets pissy or flamey.

It's almost like "the way society should work" or something...

sqs · on July 6, 2009

Here's the source: http://hackerhackernews.com/hhn-20090706.tgz

The classifier is already struggling. It doesn't seem that this is a sustainable way of classifying links, especially since the classification of technical/non-technical is arbitrary itself. I'll be trying out some other things to improve it today; email me if you want to chat about it.

bts · on July 6, 2009

FYI I'm having problems decompressing this file. Could you re-post it? or alternatively throw the code up on something like github?

sqs · on July 6, 2009

Hmm...tar xzfv hhn-20090706.tgz doesn't work for you?

sqs · on July 6, 2009

BTW, the code's up at http://bitbucket.org/sqs/hhn/overview/ now

mkyc · on July 6, 2009

A big problem with these tools is that very few will abandon this place for a slightly ugly replacement, even if it does offer something somewhat useful. If you provide an API, someone might write a script/bookmark that added the percentages to the posts here, or color coded them.

Are you training on comments as well? My bet is that the comments section will be more useful for this than the actual article.

Incidentally, I really wish that upvotes and the like were publicly visible. This would surely result in a similar tool, but for a recommendation system.

kf · on July 6, 2009

Might as well throw up Non Hacker News while you're at it.

ckunte · on July 6, 2009

I'd love a rss/feed link for this.

catone · on July 6, 2009

I'd like two. ;)

gabrielroth · on July 6, 2009

One possible next step would be a userscript (for Greasemonkey, Chrome, Fluid, etc.) that removes any story not on Hacker Hacker News from Hacker News itself. That way HHN fans could continue to use all the features of the main site.

Oompa · on July 6, 2009

Make one that just filters out celebrity news and other "Not HN" articles.

I_got_fifty · on July 6, 2009

I love that the announcement of Hacker Hacker News on Hacker News is on Hacker Hacker News!

...so meta!

swombat · on July 6, 2009

Surely an announcement about Hacker Hacker News is not Hacker Hacker News itself, so it should not be on there.

kristiandupont · on July 6, 2009

Really, it should be posted on Hacker News News.. (Sorry, I'll go out of Reddit mode now.)

kf · on July 6, 2009

http://ycombinator.com/newsnews.html

TweedHeads · on July 6, 2009

This is just an indicative that HN should go back to the roots.

Frocer · on July 6, 2009

Agreed, but I am not sure why. A couple of years ago, I would read almost every article / discussion posted on the front page of HN. But now days, I click through at most 10-20% of the content. Dunno if it's a shift in culture, or just the larger community has diluted the quality of content. Regardless, HN is still my favorite source of tech / startup news.

alex_c · on July 6, 2009

Startup news?

zackattack · on July 6, 2009

Would somebody please skin News.YC so that we can click on the "comments" link without having to squint and be super precise? I would like a big square to the right of the up/down arrows, and to the left of the Title/Score (spanning both, vertically).