Hacker News new | past | comments | ask | show | jobs | submit login
A Bayesian view of Amazon resellers (2011) (johndcook.com)
114 points by DantesKite on June 20, 2023 | hide | past | favorite | 36 comments



When I started learning about Bayesian statistics years ago, I was fascinated by the idea that a statistical procedure might take some data in a form like "94% positive out of 85,193 reviews, 98% positive out of 20,785 reviews, 99% positive out of 840 reviews" and give you an objective estimate of who is a more reliable seller. Unfortunately, over time, it become clear that a magic bullet does not exist, and in order for it to give you some estimate of who is a better seller, YOU have to provide it with a rule for how to discount positive reviews based on their count (in a form of a prior). And if you try to cheat by encoding "I don't really have a good idea of know how important the number of reviews is", the statistical procedure will (unsurprisingly) respond with "in that case, I don't know really how to re-rank them" :(


With that many reviews, any reasonable prior would have an infinitesimally small effect on the posterior. Assuming the Bernoulli model in the blog post, the posterior on the fraction f of good reviews is proportional to

  p(f|good = 85k*0.94, bad = 85k*0.06) ∝ f^79900*(1 - f)^5100 * f^<p_good = prior number of good review> * (1 - f)^<p_bad = prior number of bad reviews>
  ∝ f^(79900 + p_good)*(1 - f)^(5100 + p_bad)
Recall that the prior parameters mean that the confidence of your prior knowledge of f is equivalent to having observed p_good positive reviews and p_bad negative reviews. So unless the prior parameters are unreasonably strong (>>1000), any choice of p_good and p_bad will have negligible effect on the posterior.

The main reason Bayesian statistics is not a magic bullet is because it's up to you to interpret the posterior distribution. What really does it mean that the fraction of positive reviews from seller A is greater than the fraction for seller B with probability 0.713? What if it were 0.64? 0.93? That's for you to decide.


Yes, two extra notes on this:

1. Weakly informative priors can be good to regularize and stabilize inference in scenarios with a low amount of data.

2. In case of the review scenario presented by the OP, a hierarchical model could be even better as it would achieve regularization (and shrinking) by borrowing information across different sellers.

In other words, one would learn the overall distribution of seller reliability at the same time as individual ones. This has two advantages: a) Just one (hyper)prior for the overall distribution, but no priors needed at seller level and b) seller predictions are pulled towards the mean in a principled way.

Most scientific publications that find unusually large effects coming from some association turn out to be false. If they used shrinking (2), they would avoid excessively optimistic predictions. See: https://en.wikipedia.org/wiki/Stein%27s_example


And then there's real life: it's better to look at number of annulled negative reviews, if that stat is published. Positive reviews are gamed as standard practice. Negative reviews are bribed away. But the number of negatives removed is a good proxy for how bad the seller is.


If anyone wants to get more into Bayesian stats, I will always recommend Statistical Rethinking by Richard McElreath. Maybe my all time favorite text book. He has an accompanying youtube lecture series as well.


I thought the book was so-so, but I find his lectures excellent. His second “season” of lectures has some amazing visualizations.


I've never seen a "theatrical trailer" for a lecture series before: https://youtu.be/BYUykHScxj8


What prerequisite in math do you need to understand it? Calc/linear algebra sufficient?


If you're looking for some Bayesian Stats books that require very little math/stats, I'd recommend starting with these two:

Think Bayes by Allen Downey:https://allendowney.github.io/ThinkBayes2/

Bayesian Methods for Hackers by Cam Davidson-Pilon: https://dataorigami.net/Probabilistic-Programming-and-Bayesi...

They're both an excellent warm up for Statistical Rethinking.


Really just Algebra II from high school, and Calc I if you want to go deeper. In all honesty if you have a solid handle on Algebra II and some intuition with geometry there is nothing in all of higher math that will be beyond your reach.


Is there an adaptation of Bayesian statistics that also takes into account timeliness of the data ? e.g. a more recent string of negative reviews would potentially indicate something compared to a more smooth distribution


Sure—you just use a likelihood function that doesn’t assume reviews are independently and identically distributed (where each review is an independent Bernoulli trial) but rather have some dependence on other reviews that are nearby in time. For example, a (Markov) switching autoregressive model, e.g. [0].

Everything else about Bayes’ rule (weighting likelihood by a prior distribution and re-normalizing to obtain a distribution over model parameters) applies just the same.

[0] https://www.statsmodels.org/dev/examples/notebooks/generated...


There two approaches:

1) A moving window: you only calculate your updates from the values in the window (say the last n reviews). The downside of this method is that older values drop off precipitously.

2) A forgetting factor: there are many possibilities but one simple one is the EWMA (exponentially weighted moving average). This is pretty standard, and takes the form

  Y_updated = alpha x Y_current + (1 - alpha) x Y_previous
with alpha in [0,1]. Applying this recursively, it "forgets" older values by weight alpha. This is also known as exponential smoothing in time series. The advantage of this method is that older values are simply weighed less and drop out more gradually.


Two thoughts.

1. There is a vast literature in "Bayesian Change Point Detection." So you might find a point where something changed and a string of negative reviews began.

https://dataorigami.net/Probabilistic-Programming-and-Bayesi...

2. Similarly, there a bajillion ways to weight recent data. One way is to increase the gain on a kalman filter. That will make recent observations more important. There are bayesian implementations of the Kalman filter.

http://stefanosnikolaidis.net/course-files/CS545/Lecture6.pd...


Is "Bayesian Change Point Detection" like https://xkcd.com/2701/?


Two things I like about Bayesian Change Point Detection compared to "look at the graph"

1. It can provide a probability distribution. So after running bayes, you might have multiple places where it's likely a change has taken place.

2. It generalizes to high dimensional space. It's much harder to tilt the graph in 5-d.


Bayesian analysis means you set up at least two models and evaluate the likelihood of the observed data under each. In this case rather than assume a fixed probability of good reviews, you might model a sudden switch (for example if an account was taken over or sold) or a gradual decline (for example if a manufacturer gradually lowers quality). Then you proceed as usual from that point. (I.e. find likelihoods and assign priors for each hypothesis/model, and use Bayes' rule.)


I think you would change the prior in that situation under the assumption that it indicates some sort of underlying shift. I'm not sure that there is any theory guiding how you do that though.


(2011) The old days when Amazon reviews carried information.


Yea, things get a bit more dire if you build an adversarial model where resellers are allowed to declare reputation bankruptcy.



Even more relevant, Evan Miller has an article on this exact topic (using Bayesian statistics to calculate ratings) that goes into further detail than the original article:

https://www.evanmiller.org/bayesian-average-ratings.html


I like Evan’s 2009 post a lot, but I like John’s analysis here even better. John seems to make fewer assumptions; in particular, Evan assumes a 95% confidence bound.


In Evan’s derivation, which derives the lower Wilson confidence interval on a binomial distribution, the confidence level is a parameter — you can replace it with any level desired. He just happened to use 1.96 for 95%.

John’s derivation is not based on the Wilson score but a Bayesian update on a beta distribution. They’re actually different algorithms. He starts a beta(1,1) and keeps updating. The advantage of John’s method is you get the variance as well but now to sort you have to technically calculate differences between normal distributions which is more involved (or you can ignore the variance and sort by the mean)

Both work for normalizing sort order so that small samples don’t get biased. But as the sample sizes get larger they both converge to the expectation by the law of large numbers.

I personally use the Wilson score method in my work and it’s easy to calculate and good enough for all practical purposes.


Can someone tell me the answer to the problem in plain english? I have no idea what a beta is.


It's a poorly written article that's almost completely useless for people actually interested in this type of problem. The Beta distribution models the uncertainty of a seller's positive ratio given their total count of positive reviews (a) and negative reviews (b).


hmm... I might start discounting sellers with too many reviews. Not sure where the cutoff would be and some sellers might actually be high volume and get lots of reviews, but a huge number of reviews makes me think they are fake.


I think Amazon used to use the lower bound of a CI to sort? Or it used to be an option, then some sellers sued or threatened to based on the argument that it discriminated against smaller sellers?


A lot of words to describe bayes rule


Am I the only one who finds it intuitive and simple? Every article seems to give it a whole chapter of explanation, and vaunt it as the most groundbreaking concept ever. First I thought that was just Rationalists sniffing their farts, but I see it in a lot of places.


How would you summarize and convey it to be as simple and intuitive as you understand it?


Not contradicting you, but simple concepts are often the most groundbreaking. Finding the way to simplify something complex clearly is the hallmark of clear thinking (and Maths is laser-focused thinking).

IMO most people think of statistics in a bayesian way, even if they don't know the notation or science behind. It is so intuitive. It is the frequentist approach, taught first in school, that feels alien and forced.


Sometimes that's necessary - a lot of Bayesian-stuff is very counter-intuitive.


Essentially this. There are many more variables involved in that specific example, such as time since last negative feedback, total age of account, feedback over the last 90 days, etc.


This and nothing else


needs 2011




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: