Hacker News new | past | comments | ask | show | jobs | submit login
Websites That Feed Hacker News: Top Sources of Submissions by Median Score (github.com/antontarasenko)
134 points by anton_tarasenko on April 14, 2016 | hide | past | favorite | 48 comments



This is a pretty good example of how certain metrics aren't always relevant to reality, or, at least match the headline. The word "feed" implies that HN depends on the contributions from/links to these sites, but most users of HN would argue that domains such as github.com, github.io, nytimes.com, etc. are far more prevalent and important to HN than virtually any of the domains listed here. HN depends on daily, traffic...It's not that the sites with high medians aren't good, but they don't "feed" HN...Median score in this context is a trivial metric. Number of top stories, daily, by domain would be far more relevant in showing what "feeds the beast", as they say in the media business.


domains such as github.com, github.io, nytimes.com, etc. are far more prevalent and important to HN

NYTimes.com is certainly more prevalent but, just like Grauniad lately, it seems to be diluting quality on HN rather than being important. Same goes for Medium mentioned above.

Whereas sites like patio11's or cperciva's blogs, YC startups (bu.mp), tutorials are what makes HN unique and interesting.


> it seems to be diluting quality on HN rather than being important

I think this underscores the difficulty in quantifying the nature of "quality", especially for a broad audience. I generally check the NYT homepage every day, so seeing its URLs on HN isn't particularly helpful to me (ignoring the value of the HN discussions)...however, there is so much interesting information on a daily basis, period, that I bet if the HN front page consisted solely of the most-upvoted of high-traffic mainstream sites, e.g. github, nytimes, medium...it'd still be interesting to me because there'd be a lot that I would've missed otherwise.

That said, it'd be cool to have an option/Chrome plugin to filter the frontpage links to domains with relatively rare submissions, just to be able to quickly see the unique upvoted submissions for the day.



[insert quip about techcrunch.com]

A little off-topic, but this reminds me of how non-straightforward it is to categorize what the true "domain" of a given URL. Blogspot.com has the most submissions by domain, but the API that runs the ?site query on HN returns just 5 links:

https://news.ycombinator.com/from?site=blogspot.com

I suspect that's because HN differentiates between somerandomdude.blogspot.com and googleblog.blogspot.com...but that in itself is an editorial/arbitrary decision...Why is subdomain more relevant for differentiation than for github, in which github.com/blog is grouped in with github.com/somedudesrepo?

And FWIW, it seems subdomain faceting is done manually...education.github.com links are shown on HN's page as just github.com...whereas googleblog.blogspot.com's domain is fully listed.


There is a well-defined solution to this problem: The Public Suffix List.

https://publicsuffix.org/

blogspot.com is in it. github.com is not, but github.io (where Github Pages are hosted) is. I would guess this is what HN uses.

That said, it of course can't help with e.g. categorizing github repo links by user, since those are by path rather than subdomain. Ah well.


I feel like the best way to deal with this would be to group by subdomain if a domain has more than say 5 or 10 different subdomains submitted, but group by domain if there are less. That way "hosted" domains subdomains like *.blogspot.com get their own category, but domains controlled by a single entity are treated like a single category.


The link is updated to reflect the following

After SeanDav's question and minimaxir's comment, I summed up reposts' scores before computing the mean and the median:

HN news sources by mean score: https://docs.google.com/spreadsheets/d/1tTDDG2xg7OVKdUy4WCZ_...

HN news sources by median score: https://docs.google.com/spreadsheets/d/1P20sKg-fI6msZVZtJFe0...

HN news sources by number of submissions: https://docs.google.com/spreadsheets/d/1mmfbNWaX0Nr1P65VmwZp...

SQL code: https://github.com/antontarasenko/smq/blob/master/hackernews...

How-to: https://github.com/antontarasenko/smq


You should do number of submissions where min_score > X (maybe 5 or so). This will help filter out the spam submissions that no one ever sees.



I really find it discouraging that Sam Altman is at the top of that list. Most of his articles fall into two categories: promoting things that will make him money directly[1], or myopic musings/self-serving advice to people that will make him money indirectly[2].

Is the HN algorithm rigged in favour of things he writes, or does this community really get a lot out the things he says?

[1] http://blog.samaltman.com/asana

[2] http://blog.samaltman.com/the-tech-bust-of-2015 made me laugh, for example


Checking out the people posting from his domain it's not just him spoonfeeding his own content to users, but a range of users actually of their own free will sharing posts they find interesting by him.

Proof: https://news.ycombinator.com/from?site=samaltman.com


Sam owns YC, YC owns HN, what does it matter? The whole purpose of HN is to make Sam (and the other partners, and investors, and YC startups) money. Mindshare is incredibly valuable. It's advertising that doesn't totally suck.


Hacker News is a bit of a misnomer. It doesn't, nor has it ever, served hackers. This is a site for the startup kids, and you either love it or hate it, but you gotta accept it for what it is.


Actually the better part of HN's audience isn't involved in startups and a sizeable portion (dismayingly sizeable in my view) is cynical about them.

"Startup kids" is too dismissive. Some of the very best comments about startups come from grizzled veterans. Will ChuckMcM or Animats mind if I call them grizzled? Let's just pause to appreciate what incredible value they and others add to this community from the wealth of their experience.

Than again, depending on your definition of "kid" there are "kids" on HN whose experience with startups is already impressive. Experience should perhaps be measured in iterations, not years.

HN has many subgroups, including plenty of hackers. Plenty of purely technical stories make the front page. And the startup and hacker groups overlap.

We get complaints about the balance whichever way HN trends.


Lots of hackers are well known for their vast knowledge of the tech scene including vague startups that nobody has heard of. I know I make 'hackers' sound like hipsters but labels are usually poorly representations of a generic set of traits, people just turn todays newer labels and magnify the worst of the worst. Coincidentally I never liked labels, but hacker and geek are things that gave me a piece of mind after scraping out of high school (mostly geek).


> It doesn't, nor has it ever, served hackers.

Wrong. When I first lurked here a high-percentage of posts here were relating to startup concerns. The number have dropped dramatically over time. My non-hacker son-in-law who only knows finance was a hacker news reader a few years ago, hoping to learn tips about starting something up. He gave up due to the dwindling number of such posts.


Hacker News has its biases, but that does not mean it cannot serve hackers.


The point is that a lot of us lose sight of the bias inherent here. Sometimes I get lost in how great HN is, especially compared to similar online communities.


Surprised not to see http://nautil.us/ on here. I have to avoid clicking articles to not spoil my print version I see them so often on here.

FWIW, if anyone from that site/mag are frequent HN readers, HN is the reason I subscribed, and gifted subscriptions to several of my family for xmas this year.


They didn't do very well on HN until about a year and a half ago, as I noted in this essay: https://www.jboy.space/log/ssrc-digital-media-reflection.htm...


Besides cutting off at 10 submissions,you should probably also throw away anything that got say 2 points or less. Something like medium is brought way down by all the submissions that got 1 point, which means they probably never got seen. HN lets you resubmit low scoring items exactly for this reason.


I'd be more interested in seeing the distribution across submissions that actually made the front page.

There's a daily deluge of articles from ars, techcrunch, nytimes, etc, so the (tons) of articles that do get to the top get penalized by the ones that don't.

I don't think there's a flag for "hit front page" so might have to estimate that with a min point filter instead.


A brief motivation for the parameters:

1. Sorting by the median. The mean is not very informative for the quality of the source. Most sources provide low-scored content with eventual hits that drive the mean up. The median fixes this problem.

2. Cutting off at 10 submissions. An arbitrary minimum to exclude pure luck from the results.

In the end, this ranking excludes websites like github.com and youtube.com, but it features some less known sources.


What problem does the median fix? Many of the top sites in this list are fairly niche; some don't even really exist any more (e.g., adgrok.com being a business that sold to Twitter in 2011)...Undoubtedly, median is a better metric than mean when the desire is to remove outliers...but in the way that HN works, I'm not sure that need is relevant here. github.com and nytimes.com are absent from this list because a lot of their links get submitted...but I bet a lot more Github users can recall 5 great submissions in the past week from either domain than they can from chris-granger.com, even among fans of Light Table and Eve.

That said, I would be interested in the mean, just to see how different the two lists might be.



The mean will be worse than the median due to the influence of 1-point submissions.


It would be interesting to compute the h-index for all HN submissions, with score instead of citations, then sort them from highest to lowest.


Not sure how accurate this, alternatively it might need different assumptions - What about: NYTimes, WSJ, GitHub, BBC, ArsTechnica, Medium etc?


These websites have the low median score. That is, many submissions, many of them not relevant, so the median is low.


"Not relevant" is not the same as "not upvoted." There are a number of reasons why a submission does not receive many upvotes which are unrelated to the quality of the content itself, which is why HN has repost rules.

The 10 story minimum is to ensure a reasonable threshold for error and so a single submissions with 1000+ points (e.g. Show HNs) don't skew the results.


Do you mean that duplicates from popular sources (NY Times, WSJ, etc) spoil stats for these sources?


Not duplicates, but more noise than signal.


Actually if you look at others' analyses[0] some of those sites have a high mean and/or median. On median score alone samaltman.com should be #1. The highest "mainstream" news source would be newyorker.com, it does have a low median, but the average is ~20 times greater.

[0] https://docs.google.com/spreadsheets/d/1-TCo1mxiTkO4ZiXg5acU...


How is paulgraham.com not on this list?


Because this is looking at median score. Lots of people submit PG links as soon as they show up, but only one or two of those submissions will make it to the front page. If more than half of PG links have a score of 0–5, then the median will be in that range as well.


So HN itself does the merging and the raw dataset still includes the numerous duplicate submissions then? If this is the case it's not just sources with a lot of content like medium.com, github.com, nytimes.com being dragged down, it's any popular source.


Paulgraham.com is in the second hundred.


This morning an article I visited from the front page had only been around 21 seconds and already had 60 comments.

http://imgur.com/1oyIv2d


I'm surprised I don't see medium in here.

Even more, I'm starting to see more post from medium nowadays that has declining quality relative to 2015.


Medium is more noise than signal. There are an absurd about of Medium submissions submitted to HN (in fact, my curiosity into why everyone liked Medium all of a sudden on HN is the primary cause why I started doing data visualization on public data.)

If Reddit and YouTube submissions can have ranking penalities due to highly variable quality, so should Medium.


In the future, post this as CSV and GitHub will turn it into an even-nicer tabular format. Not to mention retaining the machine readability.


Ha! Look at Chris Granger go! Don't get me wrong, his work is awesome, but it's pretty funny to see an individual in the top 10.


Interesting but weird. Some of these sites don't seem to exist (anymore?), like muckandbrass.com


This website looks like spam. Wayback Machine doesn't have its good history: http://web.archive.org/web/20030407151435/http://www.muckand...


It was a blog about Clojure 6 years ago; 2003 is too far back.

https://web.archive.org/web/20100415161333/http://muckandbra...

In 2011, it was redirected to this blog, which is still live: https://cemerick.com/


I'm gonna put money on it, that the big hitters in the list, probably game hacker news a bit by asking their friends to vote them up.


The voting ring detector _should_ take care of that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: