Hacker News new | past | comments | ask | show | jobs | submit login
Downloading All of Hacker News Posts and Comments (shitalshah.com)
136 points by sytelus on June 2, 2014 | hide | past | favorite | 40 comments



I uploaded it to Internet Archive: https://archive.org/details/HackerNewsStoriesAndCommentsDump

Not sure two giant JSON files is the best format for this, but I used jp: http://www.paulhammond.org/jp/ to browse through it.


Thank you! I was trying to do that but couldn't find information on their policy about accepting and hosting such data. I've added links to this as well as Torrent links other folks had sent.


Just wanted to start a thread on some project ideas for this data:

* Discover geek friendly WordPress themes and plugins by analyzing CSS in stories posted on HN.

* My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).

* Find out abandonment rate of veteran users.

* Find out undiscovered great stories that didn't got in to frontpage because algorithm deficiency in HN (for example, get links posted by people with 10K+ karma but without upvotes.


Construct phrases like Yoda, shall I. Unmask this Identity, never you shall.


Oh come on George, stop it.


> My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).

I'd bet that you're not the only one who thought of that. And for less money I'd bet that someone has beat you to it.


yup :) Except I've been doing it manually as I see comments/commenters that interest me and adding them to a greasemonkey script that decorates posts with icons identifying users and their affiliations. Does that make me evil?

If I find some free time I might download this and run some extracts as sytelus suggested.


Awesome, this will be super useful for running ML experiments on the HN stories.

Previously, Max Woolf worked on this http://minimaxir.com/2014/02/hacking-hacker-news/ via https://news.ycombinator.com/item?id=7291531


I did, and with that, I had also included that Python code in GitHub, which uses the same implementation as the OP: https://github.com/minimaxir/hacker-news-download-all-storie...

Fun fact: At 10,000 calls/hour and 1,000 objects per request, you can download all stories AND comments in less than an hour.

As an aside: I tried to use ML on Hacker News stories and have had exactly zero success. (i.e. the predictive models are not statistically significantly better than the NIR)


If you prefer the Python approach, I have forked minimaxir's code to work for comments:

https://github.com/jaredsohn/hacker-news-download-all-commen...

I have a few other things in the works related to the hnsearch API (had them almost ready for release a few months ago but then I got distracted); this post is persuading me to finish them up soon. :)


I did similar analysis here http://karpathy.ca/myblog/?p=559 but it's a little less clean presentation.

I'm quite excited about this data release- there are many interesting ML models that could be trained here. One I hacked on previously on my own data was a comment ranker which uses a ranking loss to rank comments consistent with their observed order in data (which roughly reflects their number of upvotes, I believe). In principle I think it could be converted to a browser extension that gives a score for how well received your comment will be conditioned on the parent comment, as you write it in the text box. One of the main issues I ran into when I hacked a bit on it was space complexity, since you need to keep all the word embeddings (usually on order of 50-200D / word) around in memory of the extension, and there are many words.


I also did topic modeling (LDA) and other experiments on HN, code for fetching all posts and comments, and for topic modeling, can be found here: https://github.com/SnippyHolloW/HN_stats (lein run // look at hn.py)


What were your results? What did you do it for?


Have you considered sharing the data as a torrent? The current FileDropper speed is quite low (~200KB/s)

P.S. Thanks for the data, you saved a good amount of our time.


magnet:? xt=urn:btih:A3E2200A9A99906476E2E88CA002477219A3C2C3&dn=HN&tr= udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp% 3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftr acker.ccc.de%3a80%2fannounce


PLEASE add four spaces before very long unbroken strings.


999.9 megabytes?


954 mb


Danke...on my 1,000th day (which was quite a few days ago), I wanted to do an analysis of how my upvoting/submission habits changed...that is, over the 3+ years I've been on HN, did my interests in hacking and languages diversify? But in terms of what I upvoted, it's hard to tell without the entire post set whether my interests changed, or the composition of HN's submissions.

Also, it's great to be able to filter through and find all of the highly-upvoted stories that I've missed out, and programmatically push them to Pinboard. Thanks for this.


Your "Stories Download URL" is the same as the "Comments Download URL".


Fixed. Thanks for pointing this out.


Easy fix, just change comments in the url to stories.


I imagine the point of that comment was to suggest that the OP fixes it to make it easier for others.


> Content Blocked (content_filter_denied)

> Content Category: "Suspicious"

Which is an odd catch-all category. It may be a keyword match from the domain. That sucks. Anyone got the github link?



It's strange to see 1.3m stories and only 5.8m comments. I look forward to examining the data to see how many stories have 0 comments.


Lots and lots of stories never make it anywhere near the front page. Mostly spam.


Most stories are not spam, they are just posted at wrong time.

To give you an example, this story itself was posted on last Friday evening PST (https://news.ycombinator.com/item?id=7825146). It got just one upvote and 0 comments. Exact same story with exact same page content on exact same domain was posted on Monday (today) afternoon PST and it got 80+ upvotes, 30+ comments and got on frontpage for more than 6 hours!

A lot of stories are like this. HN ranking algorithm isn't perfect.


"Mostly spam."

No. Most stories in "new" get zero comments and are not spam.


But I think the spam is caught and doesn't register through the api...


Stories which are dead will not show in the API, that's correct. But the proportion of submissions that make it to the front page / get any comments is very low. (<10%)


[deleted]


I'm seeding those, plus the combined 7z file that's linked in the same GitHub issue.

Edit: scratch that, can't get any peers/DHT response - can you check if you're announcing that torrent at all?


Seems redundant, as you can already choose what file you want in the magnet link I posted. This only harms both swarms.


Thanks for the dataset! I've been wanting this corpus for a while. I have a few ideas I want to run with this.


What licence is the HN content made available under? I'm trying to find some T & C's that I might have agreed to but I haven't them yet. I don't think you can assume that comments made here are free to copy, reuse, manipulate and republish elsewhere


Great work! Does anyone know if there is anything similar for Slashdot?


The /. comments with its tags "Interesting", "Funny", "Insightful", "Informative", "Flamebait" can be very helpful for machine learning purposes.


Very helpful. I was looking for dataset like this recently. Now there it is. Great work.


Your "Fork me on Github" banner covers your hamburger menu icon.


would be nice if there was something similar for reddit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: