Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Analyzing top HN posts with language models
117 points by jayalammar on June 10, 2022 | hide | past | favorite | 43 comments
Hi HN,

I spent a few weeks looking at the top HN posts of all time. This included exploration, clustering, creating visualizations, and zooming in on what (to me personally) seems like some of the best discussions on here.

Three things in this post:

1- The interesting groups of HN posts

2- The interactive visualizations that you can explore in your browser

3- The data from this exploration -- this includes CSV of the titles as well as the text embeddings of 3,000 Ask HN articles.

Blog post about this whole process here: [1]

============

1- The interesting groups of HN posts

From the exploration, Ask HN proved the most interesting. These are the top four groups of topics I found insightful. Each group contains about 400 posts.

- Life experiences and advice threads [2]

- Technical and personal development [3]

- Software career insights, advice, and discussions [4]

- General content recommendations (blogs/podcasts) [5]

============

2- The interactive visualizations that you can explore in your browser

- Top 10,000 Hacker News articles of all time [6]

- Top 3,000 posts in Ask HN [7]

============

3- The data from this exploration

CSV file of top 3K Ask HN posts: [8]

The sentence embeddings of the titles of those posts: [9]

This is a colab notebook containing the code examples (including loading these two data files): [10]

============

If you've ever wanted to get into language models, this is a good place to start. Happy to answer any questions




The conflict of interest here concerns me. I don't object to content marketing, but I'd rather a) you were clear from the start that you work for this company and are promoting its product, and b) that this "revolves around [...] using Cohere’s Embed endpoint", so that people can judge how much they want to "get into language models" with pay-per-character pricing, as opposed to something more open.


Do you advocate for the disclaimer only because 1) the sample uses their product or 2) just because they sell a product correlated to the topic?

I see a lot of articles that fall into #2 being published here without a disclaimer. And I think a disclaimer isn't necessary for #2. Even for #1 I wouldn't bother, but I understand the expectation.

Many advocate a lot against ads, targeting, etc. If we also advocate against promotional content, what would companies do to get attention and traffic?


most content marketing goes to a post on the company's own website, which is a sort of inherent disclosure.

i don't think asking people to disclose that they work for the company whose product they're promoting is "advocating against promotional content".


That's true, it's not advocating against. Bad way of expressing.

But I do think it may undercut the power of content marketing by introducing an unconscious and unjustified negative bias.

If a stranger shares it on HN, should I judge it in a different way? What is the true value of highlighting that the author works for the company?


Do you have a better way of distinguishing between (1) somebody sharing a product because they genuinely think it is good (2) somebody sharing a product because they have a profit motive, despite the product being sh*t.

Proper disclosure is a means to expose bias and external motive. It's not perfect but it's arguably the best method we have right now. There's a reason it's written into law [1]

[1]: https://www.ftc.gov/business-guidance/resources/disclosures-...


I think the link you shared is about a potentially separate issue.

First, it's advertising in the form of paid sponsorship. Second, people have reasons to trust what a social influencer says.

I'm doing neither if I have a business or work in a business and I share here an article that I wrote that happens to promote this business.

I'm not paying a stranger to talk about the business, and you don't have any reason to trust what I say without the usual judgement you'd employ towards a stranger.


There is still some trust I have with a stranger that posts on hackernews. Perhaps misplaced, but it's there nonetheless. I trust that if they took the time to post about a product, that they genuinely believe in it. If I later learn that they are just a marketer for that product, that trust will get diminished.

To give another example, why do we trust product reviews by strangers on Amazon? They could all be just marketers in disguise after all. But Amazon actually puts effort into removing fake reviews. Why? Because honesty matters. And when you have a bias or profit motive, it's important to disclose that part too, so people can judge your review appropriately. Paid reviewers exist, but they are usually disclosed.


But amazon is full of fake reviews, and product listings, and shit... because it makes amazon money.


and the more fake reviews, the less I trust them. Likewise, the more people here advertising their products without proper disclosure, the less I'll trust the honesty of the posts. So imo it's in the interest of the community to encourage proper disclosure, so that a certain level of trust is maintained.


I don't understand why you think asking that conflicts of interests being clearly disclosed is me wanting to "advocate against promotional content". Do you believe content marketing only works if it's quietly manipulative? That seems like a pretty grim take.


True, a bad way of expressing it.

Correcting myself, what I mean is adding a disclaimer may make people look at it with suspicion. Or even walk away not giving a fair chance to get to know it.

Whether I received a note from a stranger or from the company employee, it shouldn't change how I judge and evaluate it.

Yet, I think the latter introduces bias and undercuts what could be an honest marketing effort.

This is all speculative. No data-driven backing.


If telling the truth about the origin of content created with intent of manipulating them makes readers look at it with suspicion, why is that a bad thing?

I am just gobsmacked that you think an "honest marketing effort" requires being less than honest about the fact that it's a marketing effort.


Thanks. I just added a disclosure to the comment (can't edit the parent anymore). The full embeddings are freely provided here without the need to use the service.


Thanks! I appreciate it.


"Show HN"s regularly have commercial models behind them. Not sure what your stink is...this is normal.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


"Normal" doesn't mean good, so that line of argument doesn't work for me.

But the specific "stink" I'm objecting to here is, as I said, a conflict of interest. In specific, saying, "If you've ever wanted to get into language models, this is a good place to start" purports to be neutral and helpful. When instead, this person is promoting a product. Maybe using a (to my eyes expensive) commercial service is the truly the best place to start learning that. Maybe this is truly the best service to learn it with. But we can't expect a fair answer to those questions from a person whose works at the company and whose apparent job is promoting the product that pays their salary.



Disclosure: These were made by Cohere's embeddings, a company where I work. The process should work on text embeddings from other sources.


There's too much fixation with "top" in our industry. Top voted tends to mostly be a function of early posting. Later posts don't get votes because they simply were not seen. There seems to be a misreading on a mass scale of what "top" really indicates though; people think it means "quality" when it does not. Study after study, website after website, policy after policy, our online world is built on this fundamental misunderstanding of what is really going on. How do you avoid piling on to this misunderstanding?


Is there a name for this phenomenon so I can google further? Intuitively makes sense, because I’ve seen this before.


Preferential Attachment,Power law distribution, and the 80-20 rule are all the same.

https://en.m.wikipedia.org/wiki/Preferential_attachment


"Already covered"


Interesting, but it doesn't seem like the dimensionality reduction produces a good separation of topics. The UMAP projection looks pretty dense. Did you consider pruning or using something other than embeddings?


So it really depends on what you use for clustering. In this case, I'm clustering by the original embeddings so the UMAP results are different. I've also seen:

1- Clustering by UMAP. Here the plot would show clean separation of topics. But the clustering algorithm would be working on highly compressed data (from the 1024 dimensions of the embedding down to the 2 of UMAP).

2- BERTopic's approach of doing UMAP down to 5 dimensions, using this dimensionality for clustering, then UMAP again from 5 to 2. Which is an interesting approach.

I've heard people having good results with all three. It's kinda hard to objectively compare, but my leaning was to give the clustering algorithm the representation containing the most information about the text.


Right, bertopic's double clustering is interesting. I've also seen people combine that with louvain instead of k-means.

My intuition was: UMAP itself tries to optimize for 2d separation in the projection. So we should expect at least some correspondence between the kmeans results and the layout in the UMAP plot (except in some pathological edge cases perhaps).

Nevertheless, nice example and blog post!


TIL louvain clustering! I see it used for graphs. Can also be used for vectors/points?

Thank you!


You're welcome!

You can actually create a graph by using k-means similarities as edge weights. Then you do graph clustering on it. (using any algorithm, but louvain is one of the saner ones ... clique percolation, girvan-newman etc all have known problems).


Try t-SNE. I used to scoff at cluster plots until I saw those but with t-SNE… wow, those clusters are actually separated!


Are you sure t-SNE and UMAP actually perform very differently? Last I looked, they were somewhat comparable.

[edit]: Seems they are similar for some purposes: https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-c...

Also interesting: Rapidsai has a cuda accelerated version of umap that is very fast (hdbscan as well BTW).


All these dimension reduction methods are extremely similar. The math essentially just preserves nearest neighbors, with a setting for how 'tight' you want the clusters to be.

Check out this image [1] and accompanying paper [2] for further reference

[1] https://www.semanticscholar.org/paper/A-Unifying-Perspective...

[2] https://arxiv.org/abs/2007.08902


Nice idea and analysis! I reproduced it as well with https://graphext.com and got similar clusters https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHE...


BTW there is an implicit recency bias in the dataset, since 2017 the number of top 3K post became more frequently and the avg score is larger year after year as the community in HN grows:

- Number of top 3K per month of publishing - https://drive.google.com/file/d/1beAPP9ijruMUs5DN5wOVsBArvxP...

- Avg score of top 3K per month of publishing - https://drive.google.com/file/d/10nSIgH1a6DN6XrDU2DyMJTCgsIg...


and it also looks like most topics are constant over the years - https://drive.google.com/file/d/1ilYn9cnEZwiH1FioUtU9ummhmvn...



sorry for the OT post, wanted to cheer for ploomber - love it for writing pipelines, didnt know your EU based (guessing by ur username?).


I don't know how HN score metrics work but after some short review of the datafile [1] I've noticed a lot of the posts has the form of a simple questions and as such seems to be naturally biased when comes to user engagement. Have you considered to add additional metrics to remove that bias and re-analyze?

[1] https://storage.googleapis.com/cohere-assets/blog/text-clust...


What do you mean by naturally biased? That people seem to favor them?


People seem to be more engaged in discussions arising from questions rather than statements, no?


I think that's part of the expectations out of "ask HN". I don't know that the same effect happens outside of Ask HN.


I love this shit, dude. Not for any great insight. I just enjoy this sort of thing. Good stuff.


as people upvote other things than your list of relevant links it becomes difficult to find the relevant links. although I guess people can find it by your name.

on edit: so it seems some are upvoting the links to keep them on the top in opposition to those upvoting discussion points.


Love your blog posts and visualizations Jay, thanks for sharing!


The scissor statement but it’s Go vs Rust.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: