Hacker News new | past | comments | ask | show | jobs | submit login
Deriving the Reddit Formula (evanmiller.org)
169 points by jmduke on July 14, 2015 | hide | past | favorite | 19 comments



Note: The author of this piece wrote the original analysis (http://www.evanmiller.org/how-not-to-sort-by-average-rating....) that directly led to my advocating for reddit's "best" comment sort, including to xkcd's Randall Munroe, who managed to finally convince the rest of Team Reddit to make the change (http://www.redditblog.com/2009/10/reddits-new-comment-sortin...).

So when the author says, "I realize that proposing any change to how Reddit works is one of the Internet's most dangerous games," he should be informed that it's actually a game he's won before.


Wow, I didn't know that. I was going to say that a long time ago I randomly read Evan's article about 'how not to sort by average rating'. Good work.


Do the conclusions change if we assume that the goal is not to optimize for value to the user, but for value to the company? Specifically, it seems that using the vote difference rather than the vote ratio would help controversial stories rank high. Controversial stories in turn are great at driving lots of discussion, which means lots of visits and revisits to the comment section.

(Compare to the apparent HN policy of discouraging controversy, both by the large effect of flagging and through the flamewar detector).


Possibly short term, but if you build a site around the kinds of people who like flame wars, your advertisers may notice they're getting very little traction from your site. Page views go up, but uniques go down.


Have you met Reddit? They love flamewars.


Yishan posted a couple hours ago that Reddit's business model is built upon flamewars. Basically get people heated up and then they gild the comments they agree with.

https://np.reddit.com/r/announcements/comments/3dautm/conten...

...come to think of it, this explains a lot of the Reddit drama. The Reddit Gold was flying left and right during the black-out (despite pleas not to give any money to the site), and then every time an admin, CEO, former CEO, or board member pours gas on the fire, their revenue goes up. It's brilliant! They've figured out how to make money off of hurt feelings and angry people, and now have an incentive to cause as much drama as possible.


Ha, this is great, in a not-so-great but kinda-great way!


I thought they were having advertiser issues, but maybe I'm out of the loop here.


reddit also, as far as I know, fares rather poorly on the $/view metric. (It's not quite that simple, sure.)


>This resolves one of the original mysteries — why the current time doesn't appear in Reddit's formula.

The current time did appear in the very early Reddit's formula but they figured it was computationally much faster to inflate the scores for new stories than to have to go through the database reducing the scores for loads of old stories.


Confirmed, but the time dependent model didn't ask for long. I seem to recall switching to the current model (with slightly different constants) during the switch to python in late '05.

That said, one of the early "mistakes" we made was not noticing the obvious step of taking the log of on the exponential form formula, which ended up meaning we had only a couple hundred days before we started to run afoul of the max float size in postgresql. I also seem to recall coming up with that obvious trick with only a few days to spare.


I'm really glad you're here, because I was about to comment that you're the only person I know who understands this stuff.

(for those that don't know keysersosa was the guy who wrote the current hot algorithm)


Ah - I was thinking about how exponential growth would go over the max size and how to deal with it and also wondering why they were taking logs and not putting two and two together. It's surprising how tricky it can be to see what seems obvious with hindsight.


Cool. But why is an outside researcher writing about this? Where is the research from within Reddit? They have as unique a dataset as Facebook, google or yahoo. They should do something cool with it and share it, like this guy


Don't forget the group of users who can move stories around adding or removing votes from them at will. I'm sure there's been a lot of "Put this on the front page please" favors done for friends.


I think this type of formula assumes a uniform list of links? Whereas I'd expect a discontinuity at the end of the front page. I wonder what it would look like if it took that into account, maybe the q term could incorporate the page number somehow.


The biggest problem is that reddit's voting system is still pretty easy to game


What algo does HN use?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: