Thank you! I was trying to do that but couldn't find information on their policy about accepting and hosting such data. I've added links to this as well as Torrent links other folks had sent.
Just wanted to start a thread on some project ideas for this data:
* Discover geek friendly WordPress themes and plugins by analyzing CSS in stories posted on HN.
* My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).
* Find out abandonment rate of veteran users.
* Find out undiscovered great stories that didn't got in to frontpage because algorithm deficiency in HN (for example, get links posted by people with 10K+ karma but without upvotes.
yup :) Except I've been doing it manually as I see comments/commenters that interest me and adding them to a greasemonkey script that decorates posts with icons identifying users and their affiliations. Does that make me evil?
If I find some free time I might download this and run some extracts as sytelus suggested.
Fun fact: At 10,000 calls/hour and 1,000 objects per request, you can download all stories AND comments in less than an hour.
As an aside: I tried to use ML on Hacker News stories and have had exactly zero success. (i.e. the predictive models are not statistically significantly better than the NIR)
I have a few other things in the works related to the hnsearch API (had them almost ready for release a few months ago but then I got distracted); this post is persuading me to finish them up soon. :)
I'm quite excited about this data release- there are many interesting ML models that could be trained here. One I hacked on previously on my own data was a comment ranker which uses a ranking loss to rank comments consistent with their observed order in data (which roughly reflects their number of upvotes, I believe). In principle I think it could be converted to a browser extension that gives a score for how well received your comment will be conditioned on the parent comment, as you write it in the text box. One of the main issues I ran into when I hacked a bit on it was space complexity, since you need to keep all the word embeddings (usually on order of 50-200D / word) around in memory of the extension, and there are many words.
I also did topic modeling (LDA) and other experiments on HN, code for fetching all posts and comments, and for topic modeling, can be found here: https://github.com/SnippyHolloW/HN_stats (lein run // look at hn.py)
Danke...on my 1,000th day (which was quite a few days ago), I wanted to do an analysis of how my upvoting/submission habits changed...that is, over the 3+ years I've been on HN, did my interests in hacking and languages diversify? But in terms of what I upvoted, it's hard to tell without the entire post set whether my interests changed, or the composition of HN's submissions.
Also, it's great to be able to filter through and find all of the highly-upvoted stories that I've missed out, and programmatically push them to Pinboard. Thanks for this.
Most stories are not spam, they are just posted at wrong time.
To give you an example, this story itself was posted on last Friday evening PST (https://news.ycombinator.com/item?id=7825146). It got just one upvote and 0 comments. Exact same story with exact same page content on exact same domain was posted on Monday (today) afternoon PST and it got 80+ upvotes, 30+ comments and got on frontpage for more than 6 hours!
A lot of stories are like this. HN ranking algorithm isn't perfect.
Stories which are dead will not show in the API, that's correct. But the proportion of submissions that make it to the front page / get any comments is very low. (<10%)
What licence is the HN content made available under? I'm trying to find some T & C's that I might have agreed to but I haven't them yet. I don't think you can assume that comments made here are free to copy, reuse, manipulate and republish elsewhere
Not sure two giant JSON files is the best format for this, but I used jp: http://www.paulhammond.org/jp/ to browse through it.