I made this in a few hours for my own personal use and thought that other people might want to use it, too. It uses a Python Bayesian classifier trained on about 200 hand-labeled links and outputs a static HTML file. It's not perfect (and the technical/non-technical distinction is arbitrary), but I'll give it some more training data and see if it improves. Also, I don't mean to disparage non-technical topics with this; sometimes I just want to read about programming, though
I worked in Bayesian spam filters for a while at the last day job. Wonderful things, aren't they?
I've got to say, your classifier looks like it is amazingly accurate for those distinctions. I'd be really interested in seeing some of the very determinative words. (There are some obvious candidates like "Erlang" and "Twitter" but every time I do a natural language processing project I'm amazed by how our distribution of "little meaningless words" changes markedly between FeatureA and FeatureB.)
Yeah, it's amazingly accurate for now only because the training data consists of a high percentage of the actual data set, and so it's probably classifying based on those insignificant differences. We'll see how it holds up...
I think you're training it on the submission titles - I wonder if the text of the websites themselves might be more accurate. Certainly richer. But it's quite possible that the submission titles are more accurate.
That's a great idea. Right now it's just spitting out a static HTML file, so that'd require some significant changes to the code. I'll see how it's doing tomorrow and consider making it into something like what you describe.
This sounds pretty awesome. Would you mind posting some of the code?
I'd be curious to know how you're formatting the input data. If you're just using the title, all of the text parsed through BeautifulSoup, some parsing algorithm to obtain the body of the article, etc.
Thanks. I'll definitely post the code soon. I'm using the title, the submitter's name, the URL, and the page content without tags (only looking at <p> and similar tags would probably help). Including the content of the HN comments pages might help, too. Thanks for the link; I'll check it out.
Excellent work; I am worried that I am going to have to start my own social news site to get away from all the time-wasting general interest stuff here. This will save me from having to do that for a while.
On the front page right now, there are eight programming submissions, and nine web development submissions.
(There is overlap with the programming and web development, but something like usability guidelines fits in web development but not programming. By my definition, anyway.)
I don't want to see the community fragmented into walled gardens like subreddits (I'd rather see everyone focus on submitting the best possible links to the front page), but it would be interesting to have a front-page algorithm (or an additional page on "lists") that tries to predict not the best links, but my best links.
why? I notice you've received (18) upvotes, which in this case probable means agreement, so there's obviously some support for not having this feature.
Is it specifically subreddits to avoid? I completely agree that subreddits are a terrible implementation of user-content customization because they force items into a single-parent hierarchy.
However, categories that filter content according to different weights would be quite useful. By pre-selecting these categories and thus pre-computing them, it would even be reasonable to implement (actual per-user filters might not work so well in real-time).
I'll add the timezone (PDT). It's updated whenever I run a command to update the HTML and scp that to the server. Nothing automated about the process yet.
I think that we'd be better with tagging than categories per se. Give people a finate list of tags to choose from and people could customize their views pretty easily.
I'd love to see a threshold setting. Something like "show only above 50%" so that your visitors can adjust the visible links even further. Otherwise I love the site :)
<Pardon the apparent attempt at circle jerking that's about to happen>
The reason hacker news sucks less than other places is cause when people come up with cool mashups that could be interpreted negatively (you don't like all the news chosen by the community, whaaatttt?!), no one gets pissy or flamey.
It's almost like "the way society should work" or something...
The classifier is already struggling. It doesn't seem that this is a sustainable way of classifying links, especially since the classification of technical/non-technical is arbitrary itself. I'll be trying out some other things to improve it today; email me if you want to chat about it.
A big problem with these tools is that very few will abandon this place for a slightly ugly replacement, even if it does offer something somewhat useful. If you provide an API, someone might write a script/bookmark that added the percentages to the posts here, or color coded them.
Are you training on comments as well? My bet is that the comments section will be more useful for this than the actual article.
Incidentally, I really wish that upvotes and the like were publicly visible. This would surely result in a similar tool, but for a recommendation system.
One possible next step would be a userscript (for Greasemonkey, Chrome, Fluid, etc.) that removes any story not on Hacker Hacker News from Hacker News itself. That way HHN fans could continue to use all the features of the main site.
Agreed, but I am not sure why. A couple of years ago, I would read almost every article / discussion posted on the front page of HN. But now days, I click through at most 10-20% of the content. Dunno if it's a shift in culture, or just the larger community has diluted the quality of content. Regardless, HN is still my favorite source of tech / startup news.
Would somebody please skin News.YC so that we can click on the "comments" link without having to squint and be super precise? I would like a big square to the right of the up/down arrows, and to the left of the Title/Score (spanning both, vertically).
I'd appreciate any feedback on this.