For what it's worth: a scatter plot with lots of huge points, like the one you've drawn for "upvotes vs. comments", is pretty useless for drawing conclusions about the data. It tells you about the support of the joint distribution (the region on which it's non-zero) but very little about its shape.
In particular, that graph could represent a fairly strong correlation (in the R^2 sense), or a fairly weak one, or anything in between. If you want to say something more quantitative about the data, you can do a linear regression and look at the coefficients and residuals.
The correlation between individual upvotes and comments isn't really what's the post is about, it's purely an illustration and has no impact on the topic extraction or interpretation.
For what it's worth, I did check the correlation coefficient between the two sets (it's 0.81)
Some of the categories are quite surprising, for example:
space-nasa-tesla-rocket-launch-star-CHINA-nuclear as space;
ruby not as programming;
com-http-www-EMACS-LIST-org-book-pdf as junk;
Those seem very specific, or plain wrongly classified. Maybe you can show them individually, just as a big dump of generated graphs? No need to make another post about it though, but i'd like to see them in context. Thanks!
I think you're mixing up what topics are. The actual topics as generated by LDA are the concatenated word lists (actually distributions of all words in the corpus, of which i concatenate the top 8 words to generate a meaningful descriptor of the topic). So server-client-http-request-service-ruby-connection-user is one topic / word distribution, in which "ruby" happens to be 6th most probable word, likely because it appears a lot in posts on servers, web services etc. It does not mean ruby the word itself is classified to be server related. Same applies to the other examples you gave.
The categories/domains I simply assigned manually, to show how one could possibly interpret these word distributions that LDA generated.
Not sure what you mean by a new classification approach. There is no classification here, since there are no labeled documents. This is purely unsupervised topic modelling. The topics are mathematical objects. How they are later named or grouped for better human readability is a subjective matter.
I used a similar technology stack for categorizing bookmarks (boilerpipe + gensim lda). Interesting that we wound up choosing the same tools.
In the interest of reporting on failed experiments, I also tried a k-means analysis written in php. It was slow and worthless, I wouldn't recommend anyone else going down that road.
In terms of next steps, I've been trying to use the open source HLDA software from David M. Blei's group [0] to do hierarchical clustering to avoid having to decide on the number of topic parameters. Haven't gotten it to compile on my machine yet though.
Now I'm conscious of skewing the data by posting this comment...
I don't think the API is exposed but the aggregate up vote of comments with a post might be interesting. It's one thing to have a lot of comments, but measuring the quality of the discourse would worth knowing.
The topic "language-type-program-code" is the 6th ranked topic out of 30 in terms of comments, so it's pretty high. Considering the error bars, it could possibly be even further up.
In particular, that graph could represent a fairly strong correlation (in the R^2 sense), or a fairly weak one, or anything in between. If you want to say something more quantitative about the data, you can do a linear regression and look at the coefficients and residuals.