Downloading All of Hacker News Posts and Comments

bertrandom · on June 3, 2014

I uploaded it to Internet Archive: https://archive.org/details/HackerNewsStoriesAndCommentsDump

Not sure two giant JSON files is the best format for this, but I used jp: http://www.paulhammond.org/jp/ to browse through it.

sytelus · on June 3, 2014

Thank you! I was trying to do that but couldn't find information on their policy about accepting and hosting such data. I've added links to this as well as Torrent links other folks had sent.

sytelus · on June 3, 2014

Just wanted to start a thread on some project ideas for this data:

* Discover geek friendly WordPress themes and plugins by analyzing CSS in stories posted on HN.

* My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).

* Find out abandonment rate of veteran users.

* Find out undiscovered great stories that didn't got in to frontpage because algorithm deficiency in HN (for example, get links posted by people with 10K+ karma but without upvotes.

guernica · on June 3, 2014

Construct phrases like Yoda, shall I. Unmask this Identity, never you shall.

jacquesm · on June 3, 2014

Oh come on George, stop it.

jacquesm · on June 3, 2014

> My pet EVIL project: Extract self identifying statements from comments and create profile for HN users :).

I'd bet that you're not the only one who thought of that. And for less money I'd bet that someone has beat you to it.

Flenser · on June 5, 2014

yup :) Except I've been doing it manually as I see comments/commenters that interest me and adding them to a greasemonkey script that decorates posts with icons identifying users and their affiliations. Does that make me evil?

If I find some free time I might download this and run some extracts as sytelus suggested.

ivan_ah · on June 2, 2014

Awesome, this will be super useful for running ML experiments on the HN stories.

Previously, Max Woolf worked on this http://minimaxir.com/2014/02/hacking-hacker-news/ via https://news.ycombinator.com/item?id=7291531

minimaxir · on June 2, 2014

I did, and with that, I had also included that Python code in GitHub, which uses the same implementation as the OP: https://github.com/minimaxir/hacker-news-download-all-storie...

Fun fact: At 10,000 calls/hour and 1,000 objects per request, you can download all stories AND comments in less than an hour.

As an aside: I tried to use ML on Hacker News stories and have had exactly zero success. (i.e. the predictive models are not statistically significantly better than the NIR)

jaredsohn · on June 2, 2014

If you prefer the Python approach, I have forked minimaxir's code to work for comments:

https://github.com/jaredsohn/hacker-news-download-all-commen...

I have a few other things in the works related to the hnsearch API (had them almost ready for release a few months ago but then I got distracted); this post is persuading me to finish them up soon. :)

karpathy · on June 2, 2014

I did similar analysis here http://karpathy.ca/myblog/?p=559 but it's a little less clean presentation.

I'm quite excited about this data release- there are many interesting ML models that could be trained here. One I hacked on previously on my own data was a comment ranker which uses a ranking loss to rank comments consistent with their observed order in data (which roughly reflects their number of upvotes, I believe). In principle I think it could be converted to a browser extension that gives a score for how well received your comment will be conditioned on the parent comment, as you write it in the text box. One of the main issues I ran into when I hacked a bit on it was space complexity, since you need to keep all the word embeddings (usually on order of 50-200D / word) around in memory of the extension, and there are many words.

snippyhollow · on June 2, 2014

I also did topic modeling (LDA) and other experiments on HN, code for fetching all posts and comments, and for topic modeling, can be found here: https://github.com/SnippyHolloW/HN_stats (lein run // look at hn.py)

ciex · on June 3, 2014

What were your results? What did you do it for?

SergeyHack · on June 2, 2014

Have you considered sharing the data as a torrent? The current FileDropper speed is quite low (~200KB/s)

P.S. Thanks for the data, you saved a good amount of our time.

pmalynin · on June 2, 2014

magnet:? xt=urn:btih:A3E2200A9A99906476E2E88CA002477219A3C2C3&dn=HN&tr= udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp% 3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=udp%3a%2f%2ftr acker.ccc.de%3a80%2fannounce

DanBC · on June 3, 2014

PLEASE add four spaces before very long unbroken strings.

tshaddox · on June 3, 2014

999.9 megabytes?

pmalynin · on June 3, 2014

954 mb

danso · on June 2, 2014

Danke...on my 1,000th day (which was quite a few days ago), I wanted to do an analysis of how my upvoting/submission habits changed...that is, over the 3+ years I've been on HN, did my interests in hacking and languages diversify? But in terms of what I upvoted, it's hard to tell without the entire post set whether my interests changed, or the composition of HN's submissions.

Also, it's great to be able to filter through and find all of the highly-upvoted stories that I've missed out, and programmatically push them to Pinboard. Thanks for this.

howlett · on June 2, 2014

Your "Stories Download URL" is the same as the "Comments Download URL".

sytelus · on June 3, 2014

Fixed. Thanks for pointing this out.

nekopa · on June 2, 2014

Easy fix, just change comments in the url to stories.

jaredsohn · on June 2, 2014

I imagine the point of that comment was to suggest that the OP fixes it to make it easier for others.

voltagex_ · on June 2, 2014

> Content Blocked (content_filter_denied)

> Content Category: "Suspicious"

Which is an odd catch-all category. It may be a keyword match from the domain. That sucks. Anyone got the github link?

SergeyHack · on June 2, 2014

https://github.com/sytelus/HackerNewsData

nekopa · on June 2, 2014

It's strange to see 1.3m stories and only 5.8m comments. I look forward to examining the data to see how many stories have 0 comments.

meritt · on June 2, 2014

Lots and lots of stories never make it anywhere near the front page. Mostly spam.

sytelus · on June 3, 2014

Most stories are not spam, they are just posted at wrong time.

To give you an example, this story itself was posted on last Friday evening PST (https://news.ycombinator.com/item?id=7825146). It got just one upvote and 0 comments. Exact same story with exact same page content on exact same domain was posted on Monday (today) afternoon PST and it got 80+ upvotes, 30+ comments and got on frontpage for more than 6 hours!

A lot of stories are like this. HN ranking algorithm isn't perfect.

ternaryoperator · on June 3, 2014

"Mostly spam."

No. Most stories in "new" get zero comments and are not spam.

nekopa · on June 2, 2014

But I think the spam is caught and doesn't register through the api...

minimaxir · on June 2, 2014

Stories which are dead will not show in the API, that's correct. But the proportion of submissions that make it to the front page / get any comments is very low. (<10%)

on June 3, 2014

[deleted]

voltagex_ · on June 3, 2014

I'm seeding those, plus the combined 7z file that's linked in the same GitHub issue.

Edit: scratch that, can't get any peers/DHT response - can you check if you're announcing that torrent at all?

pmalynin · on June 3, 2014

Seems redundant, as you can already choose what file you want in the magnet link I posted. This only harms both swarms.

agibsonccc · on June 3, 2014

Thanks for the dataset! I've been wanting this corpus for a while. I have a few ideas I want to run with this.

hopeless · on June 3, 2014

What licence is the HN content made available under? I'm trying to find some T & C's that I might have agreed to but I haven't them yet. I don't think you can assume that comments made here are free to copy, reuse, manipulate and republish elsewhere

simonwalton · on June 2, 2014

Great work! Does anyone know if there is anything similar for Slashdot?

frik · on June 3, 2014

The /. comments with its tags "Interesting", "Funny", "Insightful", "Informative", "Flamebait" can be very helpful for machine learning purposes.

mimighost · on June 3, 2014

Very helpful. I was looking for dataset like this recently. Now there it is. Great work.

LukeB_UK · on June 2, 2014

Your "Fork me on Github" banner covers your hamburger menu icon.

naveen99 · on June 3, 2014

would be nice if there was something similar for reddit.