Which topics get the upvote on Hacker News?

teraflop · on Feb 13, 2015

For what it's worth: a scatter plot with lots of huge points, like the one you've drawn for "upvotes vs. comments", is pretty useless for drawing conclusions about the data. It tells you about the support of the joint distribution (the region on which it's non-zero) but very little about its shape.

In particular, that graph could represent a fairly strong correlation (in the R^2 sense), or a fairly weak one, or anything in between. If you want to say something more quantitative about the data, you can do a linear regression and look at the coefficients and residuals.

datalink · on Feb 13, 2015

The correlation between individual upvotes and comments isn't really what's the post is about, it's purely an illustration and has no impact on the topic extraction or interpretation. For what it's worth, I did check the correlation coefficient between the two sets (it's 0.81)

imh · on Feb 14, 2015

When there's too much data for a scatter plot, a heat map will do nicely.

ced · on Feb 14, 2015

Or pass alpha=0.3 to the plotting function.

_lce0 · on Feb 13, 2015

some interesting chains pop up from the data!!

  - google-microsoft-windows-video-browser-user-support-chrome
  - security-key-attack-password-hacker-encryption-network-secure
  - language-type-program-code-programmer-java-write-class
  - number-point-algorithm-value-example-result-set-problem
  - space-nasa-tesla-rocket-launch-start-china-nuclear
  - data-database-map-table-analysis-information-graph-model

dangowango · on Feb 13, 2015

Some of the categories are quite surprising, for example: space-nasa-tesla-rocket-launch-star-CHINA-nuclear as space; ruby not as programming; com-http-www-EMACS-LIST-org-book-pdf as junk; Those seem very specific, or plain wrongly classified. Maybe you can show them individually, just as a big dump of generated graphs? No need to make another post about it though, but i'd like to see them in context. Thanks!

datalink · on Feb 14, 2015

I think you're mixing up what topics are. The actual topics as generated by LDA are the concatenated word lists (actually distributions of all words in the corpus, of which i concatenate the top 8 words to generate a meaningful descriptor of the topic). So server-client-http-request-service-ruby-connection-user is one topic / word distribution, in which "ruby" happens to be 6th most probable word, likely because it appears a lot in posts on servers, web services etc. It does not mean ruby the word itself is classified to be server related. Same applies to the other examples you gave.

The categories/domains I simply assigned manually, to show how one could possibly interpret these word distributions that LDA generated.

hurin · on Feb 14, 2015

I think you might want a new classification approach.

datalink · on Feb 14, 2015

Not sure what you mean by a new classification approach. There is no classification here, since there are no labeled documents. This is purely unsupervised topic modelling. The topics are mathematical objects. How they are later named or grouped for better human readability is a subjective matter.

karmacondon · on Feb 14, 2015

I used a similar technology stack for categorizing bookmarks (boilerpipe + gensim lda). Interesting that we wound up choosing the same tools.

In the interest of reporting on failed experiments, I also tried a k-means analysis written in php. It was slow and worthless, I wouldn't recommend anyone else going down that road.

In terms of next steps, I've been trying to use the open source HLDA software from David M. Blei's group [0] to do hierarchical clustering to avoid having to decide on the number of topic parameters. Haven't gotten it to compile on my machine yet though.

[0] http://www.cs.princeton.edu/~blei/topicmodeling.html

kitwalker12 · on Feb 13, 2015

Great analysis. You got me interested in the particular java libraries you use for content analysis.

so according to your analysis. this particular post won't be getting too many comments :P

ErikRogneby · on Feb 13, 2015

Now I'm conscious of skewing the data by posting this comment...

I don't think the API is exposed but the aggregate up vote of comments with a post might be interesting. It's one thing to have a lot of comments, but measuring the quality of the discourse would worth knowing.

Aldo_MX · on Feb 13, 2015

> so according to your analysis. this particular post won't be getting too many comments :P

Unless you take into account an unpredictable bias that will increase the number of comments.

Mz · on Feb 13, 2015

Could use some basic topic labels like "math" and "history" and "women's issues" (or "women in STEM" or something) and "culture."

kator · on Feb 13, 2015

I'm surprised language wars didn't show up. It seems to me "X is better than Y" threads always get a lot of heat and light.

datalink · on Feb 13, 2015

The topic "language-type-program-code" is the 6th ranked topic out of 30 in terms of comments, so it's pretty high. Considering the error bars, it could possibly be even further up.