1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.
3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.
2. Oh, excellent! We hadn't found that or we'd have used it, and we'll start working with it.
3. Tomorrow I'm going to blog about how we approached the machine learning. Short version; we manually came up with regular expressions to classify a training set based on titles. The idea is that when we experimented with manual annotations on titles, the vast majority of the time we were looking for only a few key words. There's no question that this adds biases and will not be entirely accurate, but manual inspection convinced us it was a good enough approach for our hackathon, and most of the articles we identified with the resulting algorithm would not have been found by the title regex alone.
Oh, that was silly of us not to use BigQuery! I was just able to use that download a full million stories (though we still would have had the rate-limiting step of downloading the articles).
During a hackathon it can be hard to tell when to keep searching for an easy solution like that, as opposed to going with something slow you know will work- sometimes it turns out to be a dead end.
I think the biggest value proposition of this is the ability to do sub-reddit like filtering on specific tags. As Hacker News grows I think dealing with the number of new submissions would become a bottleneck. During high-traffic times, new submissions sometimes drop of the first 'new' page in around 10 minutes. Of course there is more traffic during these times to upvote good content, but I'm not sure that is better than letting a smaller number of people have a longer period of time to filter a smaller collection of content.
Being able to filter out and automatically hide stories from the front page by tag would be lovely. There's just so much stuff I don't care about at all that gets pushed up the front page and takes up residence there, and I'm starting to wear out the "hide" link...
Strength of Hacker News is the network effect of a diverse, intelligent crowd. It would be hard to replace. A supplemental site tagging it and aiding search has value. Biggest problem I have searching HN, though, is Google mixing up stories and comments. The fix might be as simple as two domains that contain stories and comments cloned from HN, one domain for each, followed by Google Searches within those domains. Not sure if Google would automatically crawl it, though.
Also, you can easily search HN in Chrome: start typing news.ycombinator.com, it will be ready for autocomplete, and it will say in grey text "Press [tab] to search HN". The results show up from algolia.
If only there were a way to make the default search order recency instead of popularity — most of my searching is before posting something, to make sure it hasn't already been posted.
Weirdly, the default values for the search url are slightly different from what the Chrome search populates them with. The 'address bar search' defaults 'dateRange' to 'all', whilst the script itself defaults 'dateRange' to 'last24h'. Does anyone know how the Chrome address bar search is implemented?
IMO I find the domain to user facet lookup to be more useful that the tagging option - I am sure you can just deduce tags from that alone on 90% of the submissions - good demo.
As a sincere question, why did people downvote this comment?
I thought he was funny and I think lighthearted humor has value to it. It didn't seem snarky to me, did it to someone else?
Did it seem off-topic? It's a joke rather than useful information, but I'd argue that it is on-topic per the rules: "Anything that good hackers would find interesting."
Then nobody would need to visit the site at all, it'd be completely self-ranking and automated - imagine the productivity improvements in the valley? (Disrupting the web forum industry!)
(Stupid ideas? I'm full of em! Execution on stupid ideas until they get enough VC capital to become obviously good ideas? I'm pretty lame at that... Anyone wanna be my "Executing Co-founder"?)
My impression is that the bar for humor is higher on Hacker News than most other places. Perhaps at a minimum the joke should be intellectually interesting enough that there is no need to lawyer up behind a claim that the joke is on topic. Here the nexus is formulaic:
?X is, in fact, Satoshi Nakamoto!
Chuck Norris, Paul Graham, my dog Spot for ?X are each about equally humorous. I think this is because each is about as clever an intellectual move. That's not to say that a joke connecting Stallman to Nakamoto couldn't work on Hacker News. Just that it would probably require a lot more work: e.g. better premise than a hackathoned machine learning classifier might be the singularity. Even here it might have worked if the author had gone all in and backed up the claim with examples, anecdotes, rationals that pushed the joke telling art via absurdity. https://www.youtube.com/watch?v=itWxXyCfW5s
Note: While I'm replying to you, please note that I'm not claiming you've done anything like this or that you are doing these things. Rather, it's just your comment sparked these thoughts. That is all. =)
Since this topic is already fairly meta already, and because of the nature of your question, I'll chime in here as well as to why I would normally down vote your comment in other threads.
"why did people downvote this comment?"
Any discussion of voting (outside of a few exceptions such as this) gets down voted quickly. Not only is it discouraged in the guidelines, it's also generally self-correcting. I've seen far too many comments that ask why they are down voted when they clearly have more votes up than down. In addition, the conversations in reply generally revolved around why people might be voting down, and whether that is wrong.
Basically, it creates a bunch of useless commentary for no good reason.
In addition to this, asking people why they voted down a comment is annoying. The goal of commentary should be to spark either conversation or thought. If it does neither, it's really not worth my trouble to explain why I down vote it. I vote down the comment because it is a bad comment, and not worthy of worthwhile discussion.
I've voted comments up that I disagree with because the discussions they've sparked were interesting and voted down comments I agree with because they don't honestly contribute to the active discussion and exchange of ideas.
Not everyone thinks this way. I'm sure people vote up what they agree with and vote down with what they disagree with without a care to the overall discussion simply to fit an agenda. I admit I've done it in the past (I am not perfect, after all), and I've regretted it. But overall voting corrects itself, and frankly, it does not matter. Karma is representative of your value.
If you are that concerned about the karma of a comment, do not post. If people are voting down your comment and not replying, start by addressing the failings in your comment to spark proper discussion.
Blaming others (tripe like "people would rather vote down than explain where I am wrong") is weak and childish, and will get voted down without hesitation. HN should be better than that weak (non-existent?) rhetoric, and the moment you add that to a comment, you've lost.
I did similar project few months ago, it does automatic tagging + summarization of HN largely using scipy and numpy, you can see it in action: http://hntop.org
here github link https://github.com/bexp/textai
Not to hijack, but this is similar to a small ML project a friend and I built. It takes news headlines from a bunch of sources and classifies them by common topic. We took a lot longer than a day to build it, though. ;)
I would agree, for me the bright blue tag with the sharp box really makes it hard to scan the headlines without a ton of concentration ignoring the tags separately.
Here is my take on it few years ago. I tried to make it more like magazine, and get article text and photo, and there is a section with only articles that reached top position.
http://www.hnzine.com
Wow I was thinking to make something similar: an experimental HN fork where submissions are tagged (collaboratively) but without titles, as these are rarely useful to predict the content of an article. And of course there is also the convenience of categorization.
If you keep adding "magic" and doing careful research on what magic works and what doesn't, you end up roughly with the modern field of machine learning.
Random forests are a method that's often effective in taking into account many interactions among high dimensional data.
I, for one, do like the idea of tagging stuff, since I might favorite a lot of stories but then years later it's hard to find a particular one, even if you do remember the general topic.
Tags for this would be really helpful for me, ergo it's not perfect for me, ergo it's good someone else is trying to make it better.
Since it's not affecting the original site, why would you want to stop them?
A few comments:
1. To other commenters, as with the HN Vue demo a week ago (https://news.ycombinator.com/item?id=14284877), the project is a technical proof-of-concept; the aesthetics aren't the primary focus.
2. The Algolia API is better for scraping because it allows for bulk requests, unlike the official API (my old 2014 script still works I think: https://github.com/minimaxir/get-all-hacker-news-submissions...)
3) How much time did it take to manually label the training/test set before training the RF classifier? Even with topic modeling for extrapolating tags, accurate labeling for 20,000 submissions is a task.