A data science investigation of IMDB, Rotten Tomatoes, Metacritic and Fandango

davidad_ · on Aug 28, 2017

I have many problems with this analysis, perhaps enough to write my own blog post, but I am lazy so I will just outline my disagreements here (at least for now).

First: I claim that what we seek in a movie rating is information about whether we will like the movie, and that this can be formalized as the expected KL-divergence (information gain) between the Bayesian posterior distribution (probability of enjoying the movie conditional on its rating) and the prior distribution (probability you would enjoy a randomly selected movie). Of course, this will depend on your taste in movies, especially how much it correlates with others. But, we can _bound_ it by taking the Shannon entropy of the rating distribution: there is no way we can get more information from a rating than this! It is this bound that allows us to penalize the distributions that are heavily biased towards one side of a discrete scale, like Fandango. However, the "ideal" shape in this context is far from a Gaussian - it is uniform! The uniform distribution can also be justified as being calibrated such that the quantile function is linear - a score of 90/100 from a uniform distribution means a 90th-%ile movie. Determining a quantile is often a transform we try to perform intuitively on ratings so such a transform being trivial seems useful.

Second: The Gaussian distribution does not have bounded support! That is, a rating scheme with what you claim as the "ideal" distribution would have _some_ ratings with values that are negative or otherwise "off the scale". Not so ideal! If you wanted to model movie-goodness on an unbounded scale such that a Gaussian would have sense, then you should transform that scale into a bounded scale, eg with a logistic function, yielding an "ideal" shape of a logitnormal distribution, which incidentally can fit the strange bimodal Tomatometer distribution quite well. Even if you specifically wanted a unimodal, bell-shaped distribution, at least pick a bounded one like the beta distribution.

Third: setting aside which distribution you want to penalize distance from or why, dividing the space into three arbitrary intervals to facilitate the comparison seems ridiculous. There is already a perfectly good metric on probability distributions, the mutual information.

tmoertel · on Aug 28, 2017

Along the lines you suggest, a while ago I took IMDB's ratings and used their emperical cumulative distribution function to "flatten" them into something more useful, percentile scores:

http://blog.moertel.com/posts/2006-01-17-mining-gold-from-th...

This was about a decade ago, so I'd expect the resulting decoder ring to be somewhat miscalibrated for today's movie ratings. But the same process would be straightforward to apply to a more up-to-date data set of ratings.

psychometry · on Aug 28, 2017

I would read the shit out of this hypothetical blog post.

cardosof · on Aug 28, 2017

Please do write that post!

beloch · on Aug 27, 2017

Another pitfall to be wary of in analyses like these is that imdb's ratings change over time. New releases typically have inflated scores that regress over time. Ideally, this sort of analysis shouldn't use anything under a year or two old, so using only movies from 2016 and 2017 puts this particular study off to a really bad start.

alex_g · on Aug 27, 2017

But isn't that the case for all of these websites, not just IMDB?

ko27 · on Aug 27, 2017

Rotten tomatoes/meta-critic ratings don't change anymore after a month or so, unlike IMDB where newer movies are trending downward for at-least a year or two.

SubiculumCode · on Aug 28, 2017

I was disappointed by the 'data science' in this article. From selecting distribution as a criterion for the qualitry of a metric, for using Pearson correlation with non normal data, for using non correlation with Fandango as tie-breaking criterion, for failing to use external criteria (e.g. ticket sales) to validate or compare metrics, and to be picky, for failing to discuss (for example) generalized lin models with link functions to deal with non normal error distributions.

minimaxir · on Aug 27, 2017

I did a similar analysis a year and a half ago in which I found that yes, IMDb movie ratings are not uniform while RT/MetaCritic are, but that's only part of the story. All forms of movie ratings are actually poor predictors of box office success (especially with Indies/Documentaties; I'd love to look at rating/BO data again while faceting by genre of movie) (full blog post: http://minimaxir.com/2016/01/movie-revenue-ratings/)

The Four Point Scale (http://tvtropes.org/pmwiki/pmwiki.php/Main/FourPointScale), while a problem from a utilitarian point of view, is still practical from a consumer psychology point of view, which is why the popular ratings systems won't change easily.

ghaff · on Aug 27, 2017

I guess it doesn't surprise me that a fan rating site like IMDb skews high. While there are films that elicit strong negative reactions, on average I would think that people see and rate films that they expect to like. I know that there are a vast number of movies out there I would never expect to enjoy; mostly, I just ignore them. If I'm going to spend a couple hours watching something I almost certainly have at least neutral to mildly positive expectations going in.

bradknowles · on Aug 27, 2017

So, the question I would ask is does Metacritic include reviews of movies that are not in IMDB, or vice-versa? That could definitely skew the scores.

The second question I would ask is whether or not there is a relatively simple transform that could make the IMDB and maybe even the Fandango scores more uniform in their distribution, over the same set if movies?

danso · on Aug 27, 2017

Seriously doubt that Metacritic would have movies that IMDB would not, as Metacritic entries exist, ostensibly, when a movie is reviewable. IMDB has listings for every kind of movie project (in-development, pre-production).

platz · on Aug 27, 2017

I'm not sure that the justification for normality applies when considering that movies across different historical periods may be regarded as having a different average quality. Therefore the distribution across historical time would not be stationary.

That said, I've always preferred metacritic's scores over the others.

ZoeZoeBee · on Aug 27, 2017

Fandango, Rotten Tomatoes, IMDB and Metacritic are owned by content distributors/creators, Fandango and Rotten Tomatoes are owned by Comcast/Time Warner, Amazon owns IMDB while CBS owns Meta-critic.

It's no coincidence the accuracy of ratings sites have deteriorated over the past few years, one of the most glaring violations of trust has been Rotten Tomatoes pre-certifying movies as "Fresh" and keeping it so despite aggregated reviews which would contradict a "Fresh" rating. And the even more troubling trend of "Sponsored" movies like Step receiving the certification.

It should come as no surprise the number of newly released Certified Fresh Films is increasing despite the quality of films at the box office decreasing

O1111OOO · on Aug 27, 2017

Sadly... so much of what we get these days is manipulated. Can hardly wait for how big business plans to tweak "AI" to further muddy the info we receive./s

I usually first go to Wikipedia and get a summary of the critical reviews, length of film, some other stats. Head over to IMDB for a synopsis of the film (wikipedia doesn't do summary very well, focusing instead on entire plot). Maybe skim over a couple of user reviews (both positive and negative).

If it looks interesting, I'll head over to Youtube and watch the trailer. It's always the trailer that decides for me. Having watched plenty of films over the years, I pick up a tremendous amount of info from a 2-3 minute trailer.

albertgoeswoof · on Aug 27, 2017

You might as well watch the movie instead of doing all that then :-)

O1111OOO · on Aug 27, 2017

Honestly, it takes all of 5 minutes:) Saves me from throwing time and money away.

tshaddox · on Aug 27, 2017

What do you mean by "pre-certifying movies as 'Fresh'"? They're pretty clear about the definition of "Certified Fresh":

> Movies and TV shows are Certified Fresh with a steady Tomatometer of 75% or higher after a set amount of reviews (80 for wide-release movies, 40 for limited-release movies, 20 for TV shows), including 5 reviews from Top Critics.

https://www.rottentomatoes.com/browse/cf-dvd-streaming-all/

ZoeZoeBee · on Aug 27, 2017

They can claim to use that algorithm, however in practice as of the last few years, that claim is extremely suspect as we have no way to determine whether or not RottenTomatoes cherry picked its reviews for the films.

Two glaring examples of "certified fresh" films loathed by audiences and top critics are Indiana Jones and the Crystal Skulls and the new Ghostbusters. Both have Audience scores in the low 50s and top critics scores in the low 60s, but "All Critics" scores are overwhelmingly positive enough to boost the score to just over 75% making them "Certified Fresh".

As far as "pre-certified" fresh, Rotten Tomatoes has taken upon themselves to sell Sponsored content and of those I've seen, Step and The Tick, as of right now, when the subject becomes Sponsored a wave of positive reviews is sure to follow.

https://www.rottentomatoes.com/browse/upcoming?minTomato=0&m...

tshaddox · on Aug 27, 2017

> Two glaring examples of "certified fresh" films loathed by audiences and top critics are Indiana Jones and the Crystal Skulls and the new Ghostbusters. Both have Audience scores in the low 50s and top critics scores in the low 60s, but "All Critics" scores are overwhelmingly positive enough to boost the score to just over 75% making them "Certified Fresh".

Then what makes you think they cherry-picked their reviews? The Indiana Jones film has 260 reviews, and Ghostbusters has 325, both of which seem to be in the normal ballpark for huge wide releases. Indiana Jones is still at 77%, so it would still qualify as Certified Fresh today, while Ghostbusters is at 73%, just barely below the threshold.

I'm not seeing any reason to suspect foul play, and I don't understand why you would be upset. Audience scores and Top Critics are great pieces of information, but they're not how the Tomatometer or the Certified Fresh label work.

ZoeZoeBee · on Aug 27, 2017

Because baddox that is what they literally do

>Every day, a half-dozen Rotten Tomatoes staffers scour the web to find every review of every movie, collecting from major news outlets and well-known critics. They read each review and determine whether it is mostly positive or mostly negative.

>About half of the critics who appear on Rotten Tomatoes — often the more obscure set — submit their reviews, along with the ratings, to the site themselves. As reviews are indexed, Rotten Tomatoes calculates the score. http://www.timescolonist.com/entertainment/movies/rotten-tom...

The notion they cherry picked reviews comes from the more than standard deviation away from Audience and Top Critic Reviews.

As far as your last statement, it does not appear you actually understand how a review is determined to be Fresh or Not for the Tomatometer

tshaddox · on Aug 27, 2017

You're conflating two very different concepts: the source of reviews, and the algorithm for calculating the Certified Fresh status. Obviously RottenTomatoes aggregates reviews from the Internet. That is the entire point, but that doesn't match the usual definition of "cherry-picking." To convince me that they cherry-pick reviews to achieve a desired Tomatometer or Certified Fresh result, I would need some evidence. The difference between Tomatometer, audience scores, and top critics is not sufficient evidence of foul play, because it's exactly what I would expect from a fair process.

majormajor · on Aug 28, 2017

"Last few years" and "Crystal Skull" are fairly contradictory, so what do you claim has deteriorated recently?

wslh · on Aug 27, 2017

Cinephile here and I cannot follow any ranking for a non popular movie and for blockbusters I can't either. I don't buy that you can have a good ranking for everyone. You should separate in different clusters. For example, Wonder woman score at Metacritic? 76. Are you kidding me?

BTW this is my shared ranking: https://docs.google.com/spreadsheets/d/1ojCTmnu8-uIXxnas142M...

EDIT: changed 7.6 to 76 based on the comment below.

ghaff · on Aug 27, 2017

Clearly the less mainstream your tastes are, the less useful a mainstream rating is going to be. A site can, as Netflix tries to do, adjust ratings based on expressed individual preferences but it's marginally effective in my experience.

For myself, it's fairly rare that I find myself way off from the critical consensus. I'm more likely to not care for big box office action films but, then, these aren't usually at the top of critics lists either.

wslh · on Aug 27, 2017

The problem I see is that mainstream films scores are trivial to calculate and are pretty unusable because you just need to the see the list of the few successful films worldwide. Then you have a lengthy list of films that should be weighted in a different way.

ghaff · on Aug 27, 2017

Worldwide box office definitely skews to big budget action film though. Unless that's your thing, critic ratings are probably better.

Recommendations is a really tough problem even within a fairly narrow problem domain like film or music. I find Amazon and Netflix to mostly be pretty bad and you can be sure they've invested heavily.

posterboy · on Aug 27, 2017

The parent claimed there was no such thing as mainstream taste, basically.

ghaff · on Aug 27, 2017

And I'm not sure I agree.

Tastes do vary; I think it's fair to say that the average theatergoer has somewhat different tastes than I do.

But. If I read reviews of the critics at the major pubs and go to the movies they recommend, I may see some films that aren't to my tastes--just the sort of thing I KNOW isn't going to be my bag going in--and I may miss a few I'd really enjoy. But it's not a bad filter. And many of the people I know have relatively similar tastes.

Maybe that just means I'm of a similar demographic and have similar tastes to many critics. There does seem to be some consensus though.

posterboy · on Aug 28, 2017

The keyword here is hypocracy. It's of course tabu to say ones own opinion were better. Value judgement is already entailed by voicing opinion, so saying it is redundant at least, but in the extreme, opinion is suppressed by the hypocritical "mainline" opposed to criticism.

D-Coder · on Aug 27, 2017

Typo? Wonder Woman is 76 on a 0-100 scale.

Sukotto · on Aug 28, 2017

Whenever this sort of analysis shows on the homepage, I like to put a link to my favorite movie statistical-ranking site: http://www.phi-phenomenon.org/

I like it a lot and have found it valuable in figuring which movies to watch in my limited free time.

mulvya · on Aug 27, 2017

There must be platforms that try to do 'graph-matching' i.e. ask a new user to rate 25-30 popular movies. Compare against scores of other users who have rated the same movies. Identify a cluster of users who rate similarly; our new user is matched with this group. So, for other movies, the user can check the mean/modal score by members in this cluster. Sound familiar?

dandermotj · on Aug 27, 2017

You are loosely describing collaborative filtering [1], a very common recommendation technique.

[1] https://en.wikipedia.org/wiki/Collaborative_filtering

mulvya · on Aug 28, 2017

Thanks for the tip. Is there any media recommendation engine that implements it?

Note though that I'm not talking about the engine itself recommending movies. It places you within a group and shows you the scoring data of like-minded reviewers.

j_s · on Aug 28, 2017

https://www.cinesift.com/ lets you sort by combined rating.

source: https://news.ycombinator.com/item?id=14964701

robertwiblin · on Aug 28, 2017

Can anyone explain why he preferred a lower correlation with Fandango scores as a tie-breaker between Metacritic and IMDB? I couldn't follow the argument that a lower correlation shows Metacritic has more reliable results.

Tomminn · on Aug 28, 2017

I'm pretty sure there is no sound explanation. If he wanted to pick the one that was most normal he should have just found the distribution that was quantitatively most correlated with a normal distribution.

The only rub in doing this would have been deciding on what an "optimal" value for the standard deviation in a 10 point rating system is, which would have been an interesting discussion. To me, this is the most important question of this whole approach. If really big standard deviations are fine, then Rotten Tomatoes uniform distribution might turn out to be the most "normal". But he seems to have totally glossed over this important fact, which is the real difference between IMDB and the other systems. With IMBD, the standard deviation of the scores is small compared to the range of possible scores. As a result, if a movie is rated 9.0, you know its gonna be pretty damn great, and a high 9 or 10 would suggest this is one of the greatest movies ever made by some distance. That's the kind of information you can't get on a rating system where 5% of movies get 5 stars.