Hacker News new | past | comments | ask | show | jobs | submit login
How Reddit ranking algorithms work (amix.dk)
241 points by kopsai on March 22, 2012 | hide | past | favorite | 29 comments



One of the problems with Reddit in my experience is the spam filter. What happens is in the bigger subreddits (say, /r/pics or /r/funny) is that there seem to be people around who downvote stuff regardless of content, presumably to give their own posts a better chance. When your submissions consistently end up with -1 or -2 scores, the spam filter then decides that you are a spammer and you should be punished. Except in this case it's not for submitting spam, but because other people want their own submissions to have more visibility.

(As an aside, I obviously wouldn't be complaining about this if I had actually tried to spam reddit, my few submissions to date match the typical fare there)

Once the spam filter 'hates' you (as they put it) then subsequent submissions are 'ghosted' in the sense that you see them, but they don't appear in the /new section of the subreddit. Assuming you get wind of this (no downvotes, no comments, nothing) then you get to message the moderators, who may or may not respond in good time and restore your submission. By that time your obvious reaction is to delete and re-submit (because by now the post has drifted down the submission sort order), which apparently makes the spam filter hate you even more.

The problem I have with this is that by no fault of my own I ended up submitting things that were flagged as spam and hidden from view without any indication to me. You can see your posts, but no one else can. This might be a clever solution to deal with systemic spammers, but it's really annoying to normal users like me.

Karma games are fun and internet drama is lame of course; I don't lose sleep over this. But it's unfair to people who want to share something with the community. Of course you see reposts of lame memes that hit the front page and wonder why the picture of your cute hedgehog submitted in good faith ends up in the spam void.

There's supposed to be a user flag that tells the filter you're an approved submitter for a given subreddit; good luck getting that, I guess. So ultimately this makes me not want to submit anything. Right now reddit might not care about that because it has such a large user base, but I would remind them that Slashdot and Digg also thought they were invincible once.


Yeah, I think users should have a limited amount of downvote power to spread around, and that each successive downvote diminishes the power of all their previous downvotes in determining spamminess.

Ghosting really, really sucks for legitimate users. Craigslist does the same thing, and normal people are devastated when they realize they've been judged/victimized by a computer somewhere with no appeal.


I like the Stack Overflow system of downvoting

  Downvotes remove 2 reputation from the post owner.
  Downvotes on answers remove 1 reputation from you, the voter.
  Downvotes on questions are free.
  You can vote 30 times per UTC day. You get an additional 10 votes on questions only.
On Reddit, people just tend to downvote when they don't agree with something. That's really it. The only times that I've downvoted posts have been when it's a repost from recent, and on comments I only downvote if it's something that's hateful or doesn't contribute to discussion.


One thing I wish was different about the way Reddit's algo worked is the way it treats downvotes. Basically, if I am average Joe redditor, there are two potential reasons for a downvote:

1) The post is something I personally disagree with.

2) The post was written by a moron (for various definitions of the word moron).

What I wish is that someone (smarter than myself) could come up with a way to reliably determine the motive behind a downvote. I know that's no simple task. But I actually want posts that are slightly more controversial to rank higher than posts which don't generate a lot of downvotes, simply because I'm more interested in reading something thought-provoking than checking out the latest batch of meme-gen.

Basically I want comment rankings to look like this:

a) Lots of downvotes, few upvotes == total asshole. Maybe racist, maybe a general troll, but I want this ranked at the rock bottom.

b) An equal number of upvotes vs. downvotes == flamebait. I'm not interested in the latest holy wars, so we call this the next lowest in rankings.

c) Lots of upvotes, few downvotes == uninteresting. Rank it lower.

d) The ratio upvotes/downvotes is greater than 1, but less than some finite number N. These are the stories I want to read.

I want a ranking system that works something like the above, albeit with more finesse (e.g. perhaps we swap the ranking position of b and c). I'm just not interested in the latest round of adorable kitten pictures, and on HN, too much emphasis on the sheer number of upvotes manifests as groupthink. Also, I want the downvotes that are received by slightly controversial stories/comments (again, slightly controversial as opposed to PC-vs-Mac style flamebait) to have a smaller effect on the poster's karma, so that people will feel free to express disagreement on sites like reddit and HN.


Seeking controversial postings makes sense.

But why do you downvote controversial stuff then? Why not give upvote for interesting postings and downvote for boring postings?


I, personally, don't downvote as a way to express disagreement. The average redditor does. I wish it was reasonable to expect that the entire reddit community uses downvotes only in a way I consider appropriate, but I don't think it is.


I think every voter should get their own list of sorted stories and comments. That list should form based on voter's upvotes and downvotes. Then readers would quickly stop seeing postings they are likely to downvote and see more postings that they are likely to upvote. Pretty soon people who express their interest in controversy - would get controversial stories on top of their list and people who express their interest in groupthink - would get lots of stories re-confirming their groupthink.


Did not expect this post from 2010 to hit HN... Happy to see that my content is still useful :)

To add something new to the discussion I think HN should use Reddit's comment sorting algorithm (the confidence sort). It produces way better results than the current algorithm since it will rank the best comments highest regardless of their submission time. Would probably produces an even better comment section here on HN.


Can't find a reference for this at the moment, but I think PG has since modified the HN ranking code a bit to put interesting stuff at the top. (This is since his last arc dump, so I don't know exactly what it is.)

Also, I think the situation on HN is quite different from reddit because vote totals aren't shown here anymore. It's not clear that you necessarily want to use the same ranking system as on reddit, where current vote counts affect how people vote in the future.


For reference, the mathematical notation of the hot algorithm is from my master thesis[1].

[1]: http://www.duo.uio.no/sok/work.html?WORKID=81971&lang=en


I wrote a simple jsfiddle page that show you the current stories in /r/all and their associated ratings as calculated by the hot algorithm:

http://jsfiddle.net/6FVd3/embedded/result/ Source: http://jsfiddle.net/6FVd3/

One thing I noticed is that the stories aren't sorted by the algorithmic score in strict descending order. I recall some mentioned that reddit mangles the ups/downs score exposed via their API. Is this the reason?


Note that this article is from 2010. The most recent code for the Reddit ranking algorithms is here:

https://github.com/reddit/reddit/blob/master/r2/r2/lib/db/_s...

It looks like they've tweaked the confidence sort algorithm a bit.


    This has a big impact for stories that get a lot of upvotes and
    downvotes (e.g. controversial stories) as they will get a lower
    ranking than stories that just get upvotes. This could explain why
    kittens (and other non-controversial stories) rank so high :)
That actually seems backwards to me. A controversial story (that is, one that triggers a lot of votes both negative and positive) would do just fine so long as it continues to receive more upvotes than downvotes.

That's not to say that downvotes don't put a damper on things -- if a story gets 80% upvotes and 20% downvotes it will get 75% the score it would without downvotes, which on a log scale is 87.5% as high a ranking.

But the point is that things are still roughly linear -- you only need to stay positive (> 50% upvotes) to accrue score, and you will do so linearly. Maybe I am underestimating how competitive scores are, and how important upvote rate is. Maybe a small linear damper on upvote rate has a large effect relative to other content? But it definitely seems like this system encourages interesting, controversial content over bland, well-received content.


Here's a SQL implementation: https://github.com/reddit/reddit/blob/master/sql/functions.s...

Interesting here to also see their "controversy" algo


I've implemented this in one project with the addition that allows entries to 'keep afloat', or resurface if they are suddenly voted a lot. So, the behavior is more like what you would see in 'Popular now' list on Youtube, without having to compute running averages.

The way I did it was to update the post time with fractional of the difference of the old timestamp and current time, i.e:

    post_time += (current_time - post_time)/10
That way, if the post is really old, voting gives a decent boost to it, but if it's more recent, it has negligible effect.


Fascinating to see how HN and Reddit differ. I wonder if anyone can provide an analysis on which one is more appropriate for different scenarios, and why. I certainly have no clue.


Randall does a pretty good rundown of why the old comment ranking algorithm (which is similar to HN's) was bad here:

http://blog.reddit.com/2009/10/reddits-new-comment-sorting-s...

I wonder if pg would consider implementing the confidence sort on HN. It's certainly been effective on reddit.


I always like applying statistical theory to things like estimating true average ratings (Bayesian weighted), and logarithms to scores that increase exponentially. Applying suitable functions to transform values is just a great way to fine-tune the measures we use in our sites and give exactly the sort of behavior we want.

Thanks for a great link.


One thing that wasn't mentioned was that high scoring submissions are artificially deflated in their viewed score so that the highest scores can get are ~3000 votes. This keeps the massive site seeming less gargantuan and makes your vote feel like it counts.


I think I might be missing something about the effect of submission time as described. t_s is a fixed value, and the post says, "The score won't decrease as time goes by, but newer stories will get a higher score than older." So why does the graph[1] show a decreasing scoring as "submission time ago" increases with up- and down-votes fixed? t_s is fixed!

Maybe it's been too long a day. Can anyone explain?

[1]: http://amix.dk/uploads/reddit_score_time.png


The only thing I can think is that it means newer stories continuously start out with higher scores. So a top story last week might have scored 3000, an equivalently top story this week scores 3100, next week's will score 3200, and on to infinity.

Not sure if that's what it actually means or not, though.


I'm an idiot, I guess. I actually understood all along. I recognised that the initial score for new submissions would be monotonically increasing, but the graph's x axis was throwing me, by making it seem as though a single story over time had its score dropping, but that's not it. If it were flipped horizontally, I think that would be more intuitive, maybe, since it might capture "these later (further-to-the-right) bars are newer stories" instead of stepping back. Or maybe not, and I'm just an idiot. :)


It's right here in the code, with this odd fixed timestamp: seconds = epoch_seconds(date) - 1134028003 Which happens to be December 8 2005 -- a few months after reddit began and probably the date this scoring system went into effect.


I don't understand why newer stories should keep having higher and higher scores, instead of old stories having lower scores over time.

Actually I'd rather have a consistent system that allowed fair comparison of stories irrespective of time, to allow for choosing the "top ten stories of the year" for example.


I wrote a small npm module containing this algorithm as well as a few other famous ones a while ago for anyone interested in a quick JS implementation. https://github.com/clux/decay


Ranking in sites like Slashdot, Reddit, and HN is a difficult and fascinating problem. I wasn't aware of the difference between Reddit's ranking methods, nor of Munroe's involvement. Thanks for the input!


If you mean difficult in the sense that it probably took some time to find the right algorithm, than you're probably right. But the algorithm itself is rather simple. I have looked at the reddit ranking algorithm and duplicated it for a project of mine. It took some iterations but the result is rather simple.


To understand this, I translated it into Racket: https://gist.github.com/2214978


It is interesting, but I don't think as implemented, the Wilson system is the best. If an article has over 200 comments, you'll (by default) only see the "best" 200 (max 500), so those will be more likely to receive the most votes. It's their default paging that is the problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: