Show HN: Analyzing top HN posts with language models

wpietri · on June 10, 2022

The conflict of interest here concerns me. I don't object to content marketing, but I'd rather a) you were clear from the start that you work for this company and are promoting its product, and b) that this "revolves around [...] using Cohere’s Embed endpoint", so that people can judge how much they want to "get into language models" with pay-per-character pricing, as opposed to something more open.

rmbyrro · on June 10, 2022

Do you advocate for the disclaimer only because 1) the sample uses their product or 2) just because they sell a product correlated to the topic?

I see a lot of articles that fall into #2 being published here without a disclaimer. And I think a disclaimer isn't necessary for #2. Even for #1 I wouldn't bother, but I understand the expectation.

Many advocate a lot against ads, targeting, etc. If we also advocate against promotional content, what would companies do to get attention and traffic?

notatoad · on June 10, 2022

most content marketing goes to a post on the company's own website, which is a sort of inherent disclosure.

i don't think asking people to disclose that they work for the company whose product they're promoting is "advocating against promotional content".

rmbyrro · on June 11, 2022

That's true, it's not advocating against. Bad way of expressing.

But I do think it may undercut the power of content marketing by introducing an unconscious and unjustified negative bias.

If a stranger shares it on HN, should I judge it in a different way? What is the true value of highlighting that the author works for the company?

woojoo666 · on June 11, 2022

Do you have a better way of distinguishing between (1) somebody sharing a product because they genuinely think it is good (2) somebody sharing a product because they have a profit motive, despite the product being sh*t.

Proper disclosure is a means to expose bias and external motive. It's not perfect but it's arguably the best method we have right now. There's a reason it's written into law [1]

[1]: https://www.ftc.gov/business-guidance/resources/disclosures-...

rmbyrro · on June 11, 2022

I think the link you shared is about a potentially separate issue.

First, it's advertising in the form of paid sponsorship. Second, people have reasons to trust what a social influencer says.

I'm doing neither if I have a business or work in a business and I share here an article that I wrote that happens to promote this business.

I'm not paying a stranger to talk about the business, and you don't have any reason to trust what I say without the usual judgement you'd employ towards a stranger.

woojoo666 · on June 11, 2022

There is still some trust I have with a stranger that posts on hackernews. Perhaps misplaced, but it's there nonetheless. I trust that if they took the time to post about a product, that they genuinely believe in it. If I later learn that they are just a marketer for that product, that trust will get diminished.

To give another example, why do we trust product reviews by strangers on Amazon? They could all be just marketers in disguise after all. But Amazon actually puts effort into removing fake reviews. Why? Because honesty matters. And when you have a bias or profit motive, it's important to disclose that part too, so people can judge your review appropriately. Paid reviewers exist, but they are usually disclosed.

tough · on June 12, 2022

But amazon is full of fake reviews, and product listings, and shit... because it makes amazon money.

woojoo666 · on June 12, 2022

and the more fake reviews, the less I trust them. Likewise, the more people here advertising their products without proper disclosure, the less I'll trust the honesty of the posts. So imo it's in the interest of the community to encourage proper disclosure, so that a certain level of trust is maintained.

wpietri · on June 10, 2022

I don't understand why you think asking that conflicts of interests being clearly disclosed is me wanting to "advocate against promotional content". Do you believe content marketing only works if it's quietly manipulative? That seems like a pretty grim take.

rmbyrro · on June 11, 2022

True, a bad way of expressing it.

Correcting myself, what I mean is adding a disclaimer may make people look at it with suspicion. Or even walk away not giving a fair chance to get to know it.

Whether I received a note from a stranger or from the company employee, it shouldn't change how I judge and evaluate it.

Yet, I think the latter introduces bias and undercuts what could be an honest marketing effort.

This is all speculative. No data-driven backing.

wpietri · on June 11, 2022

If telling the truth about the origin of content created with intent of manipulating them makes readers look at it with suspicion, why is that a bad thing?

I am just gobsmacked that you think an "honest marketing effort" requires being less than honest about the fact that it's a marketing effort.

jayalammar · on June 10, 2022

Thanks. I just added a disclosure to the comment (can't edit the parent anymore). The full embeddings are freely provided here without the need to use the service.

wpietri · on June 10, 2022

Thanks! I appreciate it.

mbesto · on June 10, 2022

"Show HN"s regularly have commercial models behind them. Not sure what your stink is...this is normal.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

wpietri · on June 10, 2022

"Normal" doesn't mean good, so that line of argument doesn't work for me.

But the specific "stink" I'm objecting to here is, as I said, a conflict of interest. In specific, saying, "If you've ever wanted to get into language models, this is a good place to start" purports to be neutral and helpful. When instead, this person is promoting a product. Maybe using a (to my eyes expensive) commercial service is the truly the best place to start learning that. Maybe this is truly the best service to learn it with. But we can't expect a fair answer to those questions from a person whose works at the company and whose apparent job is promoting the product that pays their salary.

jayalammar · on June 10, 2022

[1] https://txt.cohere.ai/combing-for-insight-in-10-000-hacker-n...

[2] https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...

[3] https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...

[4] https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...

[5] https://assets.cohere.ai/blog/text-clustering/askhn_cluster_...

[6] https://assets.cohere.ai/blog/text-clustering/hn10k_clustere...

[7] https://assets.cohere.ai/blog/text-clustering/askhn-3k.html

[8] https://storage.googleapis.com/cohere-assets/blog/text-clust...

[9] https://storage.googleapis.com/cohere-assets/blog/text-clust...

[10] https://colab.research.google.com/github/cohere-ai/notebooks...

jayalammar · on June 10, 2022

Disclosure: These were made by Cohere's embeddings, a company where I work. The process should work on text embeddings from other sources.

natch · on June 10, 2022

There's too much fixation with "top" in our industry. Top voted tends to mostly be a function of early posting. Later posts don't get votes because they simply were not seen. There seems to be a misreading on a mass scale of what "top" really indicates though; people think it means "quality" when it does not. Study after study, website after website, policy after policy, our online world is built on this fundamental misunderstanding of what is really going on. How do you avoid piling on to this misunderstanding?

tra3 · on June 10, 2022

Is there a name for this phenomenon so I can google further? Intuitively makes sense, because I’ve seen this before.

ultra_nick · on June 11, 2022

Preferential Attachment,Power law distribution, and the 80-20 rule are all the same.

https://en.m.wikipedia.org/wiki/Preferential_attachment

mateo1 · on June 11, 2022

"Already covered"

uniqueuid · on June 10, 2022

Interesting, but it doesn't seem like the dimensionality reduction produces a good separation of topics. The UMAP projection looks pretty dense. Did you consider pruning or using something other than embeddings?

jayalammar · on June 10, 2022

So it really depends on what you use for clustering. In this case, I'm clustering by the original embeddings so the UMAP results are different. I've also seen:

1- Clustering by UMAP. Here the plot would show clean separation of topics. But the clustering algorithm would be working on highly compressed data (from the 1024 dimensions of the embedding down to the 2 of UMAP).

2- BERTopic's approach of doing UMAP down to 5 dimensions, using this dimensionality for clustering, then UMAP again from 5 to 2. Which is an interesting approach.

I've heard people having good results with all three. It's kinda hard to objectively compare, but my leaning was to give the clustering algorithm the representation containing the most information about the text.

uniqueuid · on June 10, 2022

Right, bertopic's double clustering is interesting. I've also seen people combine that with louvain instead of k-means.

My intuition was: UMAP itself tries to optimize for 2d separation in the projection. So we should expect at least some correspondence between the kmeans results and the layout in the UMAP plot (except in some pathological edge cases perhaps).

Nevertheless, nice example and blog post!

jayalammar · on June 10, 2022

TIL louvain clustering! I see it used for graphs. Can also be used for vectors/points?

Thank you!

uniqueuid · on June 10, 2022

You're welcome!

You can actually create a graph by using k-means similarities as edge weights. Then you do graph clustering on it. (using any algorithm, but louvain is one of the saner ones ... clique percolation, girvan-newman etc all have known problems).

PaulHoule · on June 10, 2022

Try t-SNE. I used to scoff at cluster plots until I saw those but with t-SNE… wow, those clusters are actually separated!

uniqueuid · on June 10, 2022

Are you sure t-SNE and UMAP actually perform very differently? Last I looked, they were somewhat comparable.

[edit]: Seems they are similar for some purposes: https://blog.bioturing.com/2022/01/14/umap-vs-t-sne-single-c...

Also interesting: Rapidsai has a cuda accelerated version of umap that is very fast (hdbscan as well BTW).

nighthawk454 · on June 10, 2022

All these dimension reduction methods are extremely similar. The math essentially just preserves nearest neighbors, with a setting for how 'tight' you want the clusters to be.

Check out this image [1] and accompanying paper [2] for further reference

[1] https://www.semanticscholar.org/paper/A-Unifying-Perspective...

[2] https://arxiv.org/abs/2007.08902

victorianoi · on June 10, 2022

Nice idea and analysis! I reproduced it as well with https://graphext.com and got similar clusters https://drive.google.com/file/d/1-kXsKezu2_S07rQn-0bjbHuUXHE...

victorianoi · on June 10, 2022

BTW there is an implicit recency bias in the dataset, since 2017 the number of top 3K post became more frequently and the avg score is larger year after year as the community in HN grows:

- Number of top 3K per month of publishing - https://drive.google.com/file/d/1beAPP9ijruMUs5DN5wOVsBArvxP...

- Avg score of top 3K per month of publishing - https://drive.google.com/file/d/10nSIgH1a6DN6XrDU2DyMJTCgsIg...

victorianoi · on June 10, 2022

and it also looks like most topics are constant over the years - https://drive.google.com/file/d/1ilYn9cnEZwiH1FioUtU9ummhmvn...

hoerzu · on June 10, 2022

I grouped posts by topic: https://n3ws.ploomber.io

Explanation: https://ploomber.io/blog/hn_classifier/

chrisMyzel · on June 11, 2022

sorry for the OT post, wanted to cheer for ploomber - love it for writing pipelines, didnt know your EU based (guessing by ur username?).

toppy · on June 10, 2022

I don't know how HN score metrics work but after some short review of the datafile [1] I've noticed a lot of the posts has the form of a simple questions and as such seems to be naturally biased when comes to user engagement. Have you considered to add additional metrics to remove that bias and re-analyze?

[1] https://storage.googleapis.com/cohere-assets/blog/text-clust...

jayalammar · on June 10, 2022

What do you mean by naturally biased? That people seem to favor them?

toppy · on June 10, 2022

People seem to be more engaged in discussions arising from questions rather than statements, no?

jayalammar · on June 10, 2022

I think that's part of the expectations out of "ask HN". I don't know that the same effect happens outside of Ask HN.

renewiltord · on June 11, 2022

I love this shit, dude. Not for any great insight. I just enjoy this sort of thing. Good stuff.

bryanrasmussen · on June 10, 2022

as people upvote other things than your list of relevant links it becomes difficult to find the relevant links. although I guess people can find it by your name.

on edit: so it seems some are upvoting the links to keep them on the top in opposition to those upvoting discussion points.

arolihas · on June 11, 2022

Love your blog posts and visualizations Jay, thanks for sharing!

earthboundkid · on June 11, 2022

The scissor statement but it’s Go vs Rust.