Hacker News new | past | comments | ask | show | jobs | submit login
Hacker Hacker News - see just the programming/math/science links from HN (hackerhackernews.com)
164 points by sqs on July 6, 2009 | hide | past | favorite | 56 comments



I made this in a few hours for my own personal use and thought that other people might want to use it, too. It uses a Python Bayesian classifier trained on about 200 hand-labeled links and outputs a static HTML file. It's not perfect (and the technical/non-technical distinction is arbitrary), but I'll give it some more training data and see if it improves. Also, I don't mean to disparage non-technical topics with this; sometimes I just want to read about programming, though

I'd appreciate any feedback on this.


I worked in Bayesian spam filters for a while at the last day job. Wonderful things, aren't they?

I've got to say, your classifier looks like it is amazingly accurate for those distinctions. I'd be really interested in seeing some of the very determinative words. (There are some obvious candidates like "Erlang" and "Twitter" but every time I do a natural language processing project I'm amazed by how our distribution of "little meaningless words" changes markedly between FeatureA and FeatureB.)


Yeah, it's amazingly accurate for now only because the training data consists of a high percentage of the actual data set, and so it's probably classifying based on those insignificant differences. We'll see how it holds up...


I think you're training it on the submission titles - I wonder if the text of the websites themselves might be more accurate. Certainly richer. But it's quite possible that the submission titles are more accurate.


Says above he's using body text.


Why don't you provide a flag kind of button for each article that lets user train the classifier online?


That's a great idea. Right now it's just spitting out a static HTML file, so that'd require some significant changes to the code. I'll see how it's doing tomorrow and consider making it into something like what you describe.


Are you using the reverend thomas library? http://www.divmod.org/trac/wiki/DivmodReverend

I've been wanting to use that library for something just like this. I would love to read a write up of this, or see the code at some point.

Thanks


Yeah, that's exactly the Bayesian classifier library that I'm using. I'll have the code up in a few hours.

edit: source at http://hackerhackernews.com/hhn-20090706.tgz



This sounds pretty awesome. Would you mind posting some of the code?

I'd be curious to know how you're formatting the input data. If you're just using the title, all of the text parsed through BeautifulSoup, some parsing algorithm to obtain the body of the article, etc.

Not too long ago, I wrote a python script to automatically extract the bodies of articles, which you may find useful: http://blog.davidziegler.net/post/122176962/a-python-script-...


Thanks. I'll definitely post the code soon. I'm using the title, the submitter's name, the URL, and the page content without tags (only looking at <p> and similar tags would probably help). Including the content of the HN comments pages might help, too. Thanks for the link; I'll check it out.



Excellent work; I am worried that I am going to have to start my own social news site to get away from all the time-wasting general interest stuff here. This will save me from having to do that for a while.


That doesn't sound very social. Aren't most programming/math/science also "time-wasting", so to speak?


People want to read about programming. Right now, there are three options:

* News.YC, which rarely has articles about programming anymore (perhaps one or two a day).

* Programming Reddit, whose commentators think every article is a jab at The One True Idea, which they seem to think is PHP.

* Lambda the Ultimate, which only covers a small area of programming; language design and implementation

I want a site like LtU but with a broader appeal.


On the front page right now, there are eight programming submissions, and nine web development submissions.

(There is overlap with the programming and web development, but something like usability guidelines fits in web development but not programming. By my definition, anyway.)


> * Lambda the Ultimate, which only covers a small area of programming; language design and implementation

Implementation (!= Programming Language Theory) is off topic for LTU, unfortunately.


I came to news.yc because I wanted to read about startups. There are very few % of articles about startups now.

(I am sticking around because of the general interest articles which is what I used to visit reddit for before it became too popular)


Now we need Startup Hacker News.


What about a generic My Hacker News with a classifier trained on each person's personal "I want to see / I do not want to see" preferences? :>


HN doesn't need subreddits.


I don't want to see the community fragmented into walled gardens like subreddits (I'd rather see everyone focus on submitting the best possible links to the front page), but it would be interesting to have a front-page algorithm (or an additional page on "lists") that tries to predict not the best links, but my best links.


why? I notice you've received (18) upvotes, which in this case probable means agreement, so there's obviously some support for not having this feature.

Is it specifically subreddits to avoid? I completely agree that subreddits are a terrible implementation of user-content customization because they force items into a single-parent hierarchy.

However, categories that filter content according to different weights would be quite useful. By pre-selecting these categories and thus pre-computing them, it would even be reasonable to implement (actual per-user filters might not work so well in real-time).


The fear I have is it would splinter the community.


But HN has become so big. It not logical to think that everyone will be interested by everything now.


I'm not convinced the community has become so big that subdivisions are necessary.


> Page generated automatically at 23:49 on 05 July 2009

Can you specify the timezone?

Also, how often is the page updated?


I'll add the timezone (PDT). It's updated whenever I run a command to update the HTML and scp that to the server. Nothing automated about the process yet.


I think you've stumbled on something that Hacker News needs: Categories.


I think that we'd be better with tagging than categories per se. Give people a finate list of tags to choose from and people could customize their views pretty easily.


Newsvine is a good example of tags making a community more focused and easier to handle.


as long as the same story can't be submitted by 2 different people by submitting it into different catetgories


This is really cool. Maybe you can use these links to train: (http://www.gabrielweinberg.com/startupswiki/Ask_YC_Archive)


I'd love to see a threshold setting. Something like "show only above 50%" so that your visitors can adjust the visible links even further. Otherwise I love the site :)


Have you considered applying to the YC program with this/variant of this idea?


Could you make a page that is the exact opposite? Non Hacker Hacker News?


<Pardon the apparent attempt at circle jerking that's about to happen>

The reason hacker news sucks less than other places is cause when people come up with cool mashups that could be interpreted negatively (you don't like all the news chosen by the community, whaaatttt?!), no one gets pissy or flamey.

It's almost like "the way society should work" or something...


Here's the source: http://hackerhackernews.com/hhn-20090706.tgz

The classifier is already struggling. It doesn't seem that this is a sustainable way of classifying links, especially since the classification of technical/non-technical is arbitrary itself. I'll be trying out some other things to improve it today; email me if you want to chat about it.


FYI I'm having problems decompressing this file. Could you re-post it? or alternatively throw the code up on something like github?


Hmm...tar xzfv hhn-20090706.tgz doesn't work for you?


BTW, the code's up at http://bitbucket.org/sqs/hhn/overview/ now


A big problem with these tools is that very few will abandon this place for a slightly ugly replacement, even if it does offer something somewhat useful. If you provide an API, someone might write a script/bookmark that added the percentages to the posts here, or color coded them.

Are you training on comments as well? My bet is that the comments section will be more useful for this than the actual article.

Incidentally, I really wish that upvotes and the like were publicly visible. This would surely result in a similar tool, but for a recommendation system.


Might as well throw up Non Hacker News while you're at it.


I'd love a rss/feed link for this.


I'd like two. ;)


One possible next step would be a userscript (for Greasemonkey, Chrome, Fluid, etc.) that removes any story not on Hacker Hacker News from Hacker News itself. That way HHN fans could continue to use all the features of the main site.


Make one that just filters out celebrity news and other "Not HN" articles.


I love that the announcement of Hacker Hacker News on Hacker News is on Hacker Hacker News!

...so meta!


Surely an announcement about Hacker Hacker News is not Hacker Hacker News itself, so it should not be on there.


Really, it should be posted on Hacker News News.. (Sorry, I'll go out of Reddit mode now.)



This is just an indicative that HN should go back to the roots.


Agreed, but I am not sure why. A couple of years ago, I would read almost every article / discussion posted on the front page of HN. But now days, I click through at most 10-20% of the content. Dunno if it's a shift in culture, or just the larger community has diluted the quality of content. Regardless, HN is still my favorite source of tech / startup news.


Startup news?


Would somebody please skin News.YC so that we can click on the "comments" link without having to squint and be super precise? I would like a big square to the right of the up/down arrows, and to the left of the Title/Score (spanning both, vertically).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: