Hacker News new | past | comments | ask | show | jobs | submit login
Bayesian ranking of items with up and downvotes or 5 star ratings (2015) (julesjacobs.github.io)
159 points by mooreds on Nov 16, 2017 | hide | past | favorite | 24 comments



I recommend the approach described in this article:

http://www.evanmiller.org/ranking-items-with-star-ratings.ht...

In this formulation, s_k equals utility. Like the Wilson score formula (and unlike the linked article), the provided equation takes into account the variance of the expected utility.


I find that article very hard to follow -- there are lots of detailed formulas, but no obvious place where the prior distribution is discussed, or the utility score given to different star ratings. And the examples are all very abstract.

Edit to add: ah, I think I see, the utility of N stars is assumed to be N, and the prior is all ones. But aren't those the most important things to tune in a Bayesian model?


Another practical Bayesian approach that is much easier to understand and to productionize, is described here: https://www.johndcook.com/blog/2011/09/27/bayesian-amazon/

It does assume a Beta(1,1) prior however.


With star ratings, I think an important point that often gets ignored is: different people use stars in different ways. One user might 5-star most things, but give the occasional 4- or 3-star review if they have a problem. But another user might 3-star by default, and save their 4- and 5-star reviews for exceptionally good cases.

I wonder if a simple way to fix that might be to reinterpret everyone's star ratings as percentiles, based on the overall distribution of stars in their reviews. "This user gives 5 stars 10% of the time, so we'll interpret a 5-star review from them as anything in the range 90-100 -- assume 95%."

You would probably also want to reinterpret the results for each user. "This review scores average out as 84%. For user A, that's 4.5 stars, but for user B, it's only 3.5 stars."

The big downside is that star ratings become subjective. But they're already subjective, and ignoring that problem doesn't make the results any better. Average star ratings on all the big websites and app stores right now are garbage -- they'll usually warn you if some Amazon product is terrible, but that's about all.

If you crunch all the review data and figure out the best possible recommendations, you end up with collaborative filtering and the Netflix Prize. It's a shame that so much great work was done for that competition, but nobody seems to be using it now. Netflix themselves just use a trivial upvote scheme now.

But I wonder if there's some much simpler approach that still gets pretty good results.


Or even a simple thumbs up or thumbs down. Less open to interpretation on how the user uses stars. 1 star or 5 star basically.


I wrote this a couple of years ago [1]. I think we need to remove subjectivity on ratings by asking more specific questions and only allowing a binary answer.

1. Is the food good? 2. Is the service good? 3. Is the atmosphere good?

That's a pretty simple answer. Often when I see 1 star reviews it's because of a single element of the experience but not the overall experience.

It's easier to leave a review because there's less cognitive load. It's easier to search for what you want: if I have my foodie hat on, I don't particularly care about the service. If it's a night out with a customer, that becomes more important all of a sudden.

And then you can generate some sort of average score based on the answers to these questions to calculate the 5 star rating.

[1] https://medium.com/@acrooksie/no-more-5-star-rating-systems-...


I do prefer that over stars, but I think it potentially misses some information. Let's say most people answer "good" for all the categories. Does that just mean the place is good overall, or is it fantastic?

To put it another way, how do you distinguish the 4.0-star places from the 4.9-star places?

With conventional star ratings, you're reliant on most people using stars consistently. With a series of yes/no questions, you're relying on a potentially small pool of "no" answers to give you a useful signal.

I think stack ranking would be much more powerful. "How does this place compare to others? Average, better than average, in your all time top 5?" Everybody's feedback would be completely clear. It's not obvious how to aggregate that into a single rating number though.


Given a set of questions - e.g. "how's the food" "how's the atmosphere" "how's the service" etc. - you could figure out how the restaurant scores relative to others by stack ranking based on the % of answers to a particular question that got a "Yes". The numbers should hopefully reflect a normal distribution and from there you get your /5 rating.

If everybody answers "yes" to all of the questions - good value, service, food, atmosphere - then that suggests to me that it's a great restaurant. And you can have a lot of questions that are even asked randomly to limit the number of questions per user.

I rate a lot of places highly that have great a lot of things but not great service, because I don't think the service is bad enough to bring it down. But that's data that is being lost.

I like your idea of stack ranking but with a different flavour. I think that "in your all time top 5" is a hard question to answer. How about this though - if we know you've been to Taco Place X and now you're going to Taco Place Y, maybe the question is "are the tacos at Y better than X", "is the atmosphere at Y better than X" or even "is Y better than X" (but I like the idea of collecting more granular data).

If you collect this^ data to stack rank. Then it definitely gives you a better distribution of restaurants relative to each other in each category.

As a consumer, with this level of granularity, I can select what I care about tonight. If I'm grabbing takeout for lunch at work, does a five star rating even matter? I should ask Siri "show me the top fast and delicious takeout restaurants near me" and she should do: "select name from restaurants where distance < 500m order by (speed + flavour) limit 3;" and from there I will pick something from that list that looks nice. That seems like a nice UX.


There's a body of research on this, and it suggests that ratings are more meaningful if you add options, up to about 5 or 6 ratings.

That is, if you asked people to do the ratings once, and then asked them 1 hour later, there would be more consistency across time as you add options from 2 to 3 to 4, up to about 5 or 6.

The problem with binary ratings is that, as much as you might think otherwise, you're forcing a kind of hazy, grey experiential assessment into 0 or 1. And in doing so, people near the boundary (whatever that might be) will vacillate between them. E.g., people who feel "meh" about something are forced to choose something else, and sometimes they'll say 0 and sometimes 1. The more options you give, the more reliable / meaningful the ratings will be.

This example is interesting to me because it's something most people can relate to and illustrates the complications of utility-based and Bayesian formulations of the problem. You end up having to decide on utilities and/or priors.

To me the answer is to weight the data maximally in forming a posterior, in which case you end up using a reference prior. Similar kinds of arguments about utilities lead to reference priors. Reference priors can be complicated to compute, but for things like multinomials over ordinal ratings, reference priors have been worked out fairly well.

To me it always made sense to allow people to sort by the center of the estimate, or the lower bound (maybe using different language).


Slight tangent--

I think 1-4 stars is the ideal rating style. I wish that were used more often.

A choice of 1-4 stars gives you enough freedom to express your opinion, without being overwhelming. It's a small enough range to be reasonably objective (almost everybody will interpret it as 1 star = bad, 2 = passable, 3 = good, 4 = great). And with an even number of choices there's no middle "meh" option -- you're forced to make a choice between 2 and 3.

Of course it's important not to ruin it by adding extra options, like 0 stars or half-stars. (That was Ebert's big mistake!)

Edit to add: to relate this to the parent post, I'm thinking that maybe ranking things as 1-4 stars in several categories could be the best if both worlds.


Evan, I'm confused. The author refers to two articles by you and here is a third. Can you comment on where they come from, and in what order?


Not a statistician, but this still seems flawed. The pretend votes need to be related to the person seeing the list of items. These normally come from the population (i.e. if you were ranking Netflix, the pretend votes would be the sum of all votes that exist for every movie, grouped by star count). This makes sense, because if you had no other information, your guess would just be the average of all the existing ratings.

The problem is that the pretend votes need to be culled in order to be predictive. Otherwise they dominate in the arithmetic. They need to be more specific to the user looking at the ranking. Continuing with the Netflix example, if a user was looking for scary movies, the pretend votes need to come from the corpus of all scary movies, rather than all movies that exist.

Here's the problem, there doesn't seem to be a good way to narrow the pretend votes. Worse, there isn't a good way to combine the two. If the pretend votes came from two sources, its not clear what to do. For example, if the user is from California, the California pretend votes (priors?) need to be combined with the scary movie pretend votes.

How can we add pretend votes without justifying where they came from?


It doesn't have to be correct, just a plausible starting point. The "pretend votes" have less importance as more real votes come in.

I do think this article suggests add too many pretend votes. Without the kind of justification you're talking about, it's usually better to add only a couple (reflecting low confidence in the prior).


I'm just getting into stats, but the way I see it, a new item has 0 votes with an error bar +/- the number of possible votes. Each sorting should then include a randomization factor related to the error bar, and so randomly promote some new items into top rankings so they get some exposure to gather votes. As they accumulate votes, the error bar shrinks as the ranking becomes a little more certain.


I believe the right way to think about it isn't error bars, but the entire probability distribution -- what's the probability that if everybody voted, the upvote/downvote ratio would be 75/25, 80/20, 85/15, etc. Once you've figured out the probability distribution, you can calculate error bars any way you like (e.g. 95% confidence interval).

The beta distribution is one model you can choose for that probability distribution, which happens to have some nice properties that make it easy to work with.

The other question is, what's the "zero knowledge" probability distribution? I think your "0 votes with an error bar +/- the number of possible votes" would translate to "uniform probability of any result", which I think is beta(1,1).

Depending on the scenario, though, you might look at the data and observe that extreme values are very uncommon, and therefore start with something like beta(2,2) instead (a bell curve rather than a flat distribution). That has minimal impact once you have lots of real upvote/downvote data, but it makes a huge difference to how the first few votes are interpreted.


Right, that sounds more like what I meant. Still familiarizing myself with the terminology, thanks!


I like to think about it this way. By not explicitly imposing a prior, you are implicitly imposing a prior that each item will receive no votes. This is totally non sensical because of course these items will get votes.

Just because we don't know what the true value of p will be doesn't mean we don't have some expectation. If I asked you what you expect the popularity of a given item will be, you won't say 0, you'll say something like the average. So why assume all items will have 0 votes in our model?



Anyone here have thoughts on why, all these years later, Amazon still doesn't have a sort option along the lines of these proposals? It seems like such an easy win and an easy technical change. Do they have some business reason not to change their default sort?


I'm not sure what you mean -- could you elaborate?

Amazon probably doesn't use straigt score averaging to decide "best" items sort, and this is just proposals of how to change that to be better by not just using averages. So what is it you're looking for Amazon to add?

Disclaimer: work at Amazon, not on anything search related.


Amazon has the default "Featured" sort (I'm not sure what is behind this, but it intuitively seems like some combination of popularity + availability + rating). If this default doesn't fit your needs, your only option is to change to sort by "Avg. Customer Review", which gets you a list that is sorted by average rating regardless of the number of reviews. Evan called out nearly 10 years ago in the post that OP's article mentioned - http://www.evanmiller.org/how-not-to-sort-by-average-rating..... The root problem is that one random obscure product with a single 5-star rating out-ranks something with 499 5-star ratings and 1 4-star rating.

I'm often looking for what is the best/highest-quality item in a category, meaning I want not just a high average, but a high average that is statistically meaningful. I'm just surprised Amazon hasn't offered a way to do that (and have read umpteen threads on HN in the past years expressing the same frustration).


....Default 'featured' sort?

When I go to Amazon.com and search, I see 'relevance' as my default, with 'featured', some price related ones, 'average', and 'new' as options. ('Featured' only seems to exist on some products, and be related to ads.)

Is it not the same for you?

-----

As for your main point (because I think that your complaint is still valid even with 'relevance' as the default), it sounds like what you want is a way to choose what factors are applied to your sort.

I'm not sure, but it seems likely that 'relevance' is doing more than just averaging, and so being able to select which parts you apply (eg, only use a statistical notion of best, don't consider availability or shipping times) would cover your usecase, right?

Well, you might want to be able to choose between a few models of 'best', but the real issue, the core need, is that you want control over the model that Amazon is using to sort what you see and to have some input on what that looks like. (And not just have 'lolsux' or 'Amznsort', to be a little glib.)

Gotta say, that actually sounds like a pretty reasonable ask. I'm not sure why it doesn't work that way, either.


Yeah, my above comment was not using a text search, hence no "relevant" option (i.e. if you just drilled down the department hierarchy to, say, the TVs department).

> the core need, is that you want control over the model that Amazon is using to sort what you see and to have some input on what that looks like

Indeed, but I'm not even looking to have that much granular control over it. I just want "sort by rating, but toss out all the obscure crap that has 1 or 2 ratings, because that rating is meaningless."


Unfortunately, no matter how you twist and turn it ordinal data is going to stay ordinal. You cannot try to make it meaningful regardless of how you aggregate it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: