Hacker News new | past | comments | ask | show | jobs | submit login
Which topics get the upvote on Hacker News? (datadive.net)
110 points by datalink on Feb 13, 2015 | hide | past | favorite | 16 comments



For what it's worth: a scatter plot with lots of huge points, like the one you've drawn for "upvotes vs. comments", is pretty useless for drawing conclusions about the data. It tells you about the support of the joint distribution (the region on which it's non-zero) but very little about its shape.

In particular, that graph could represent a fairly strong correlation (in the R^2 sense), or a fairly weak one, or anything in between. If you want to say something more quantitative about the data, you can do a linear regression and look at the coefficients and residuals.


The correlation between individual upvotes and comments isn't really what's the post is about, it's purely an illustration and has no impact on the topic extraction or interpretation. For what it's worth, I did check the correlation coefficient between the two sets (it's 0.81)


When there's too much data for a scatter plot, a heat map will do nicely.


Or pass alpha=0.3 to the plotting function.


some interesting chains pop up from the data!!

  - google-microsoft-windows-video-browser-user-support-chrome
  - security-key-attack-password-hacker-encryption-network-secure
  - language-type-program-code-programmer-java-write-class
  - number-point-algorithm-value-example-result-set-problem
  - space-nasa-tesla-rocket-launch-start-china-nuclear
  - data-database-map-table-analysis-information-graph-model


Some of the categories are quite surprising, for example: space-nasa-tesla-rocket-launch-star-CHINA-nuclear as space; ruby not as programming; com-http-www-EMACS-LIST-org-book-pdf as junk; Those seem very specific, or plain wrongly classified. Maybe you can show them individually, just as a big dump of generated graphs? No need to make another post about it though, but i'd like to see them in context. Thanks!


I think you're mixing up what topics are. The actual topics as generated by LDA are the concatenated word lists (actually distributions of all words in the corpus, of which i concatenate the top 8 words to generate a meaningful descriptor of the topic). So server-client-http-request-service-ruby-connection-user is one topic / word distribution, in which "ruby" happens to be 6th most probable word, likely because it appears a lot in posts on servers, web services etc. It does not mean ruby the word itself is classified to be server related. Same applies to the other examples you gave.

The categories/domains I simply assigned manually, to show how one could possibly interpret these word distributions that LDA generated.


I think you might want a new classification approach.


Not sure what you mean by a new classification approach. There is no classification here, since there are no labeled documents. This is purely unsupervised topic modelling. The topics are mathematical objects. How they are later named or grouped for better human readability is a subjective matter.


I used a similar technology stack for categorizing bookmarks (boilerpipe + gensim lda). Interesting that we wound up choosing the same tools.

In the interest of reporting on failed experiments, I also tried a k-means analysis written in php. It was slow and worthless, I wouldn't recommend anyone else going down that road.

In terms of next steps, I've been trying to use the open source HLDA software from David M. Blei's group [0] to do hierarchical clustering to avoid having to decide on the number of topic parameters. Haven't gotten it to compile on my machine yet though.

[0] http://www.cs.princeton.edu/~blei/topicmodeling.html


Great analysis. You got me interested in the particular java libraries you use for content analysis.

so according to your analysis. this particular post won't be getting too many comments :P


Now I'm conscious of skewing the data by posting this comment...

I don't think the API is exposed but the aggregate up vote of comments with a post might be interesting. It's one thing to have a lot of comments, but measuring the quality of the discourse would worth knowing.


> so according to your analysis. this particular post won't be getting too many comments :P

Unless you take into account an unpredictable bias that will increase the number of comments.


Could use some basic topic labels like "math" and "history" and "women's issues" (or "women in STEM" or something) and "culture."


I'm surprised language wars didn't show up. It seems to me "X is better than Y" threads always get a lot of heat and light.


The topic "language-type-program-code" is the 6th ranked topic out of 30 in terms of comments, so it's pretty high. Considering the error bars, it could possibly be even further up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: