Why I’m Not a Fan of R-Squared

mattb314 · on July 24, 2016

"As such, R^2 answers the question: 'does my model perform better than a constant model?' But we often would like to answer a very different question: 'does my model perform worse than the true model?' "

Maybe I'm over-generalizing here, but I think this fundamental assumption is untrue for most people who use statistical models to solve actual problems. All of the "undesirable" behavior of the R^2 metric makes complete sense when you view it as a comparison with the most naive model (the constant model), and while R^2 certainly doesn't measure how close to "true" a model is, it very well captures the utility of using a more sophisticated model over an extremely simple one, which (I believe) is a critical question.

For example, if you had predict a process where measurement noise overwhelms variation in the process itself, as in his first example of log(x) from .99 to 1.0, then the constant model is pretty much the best you can do, and both log(x) itself and the linear model offer little additional benefit. Getting low R^2 values for those two models makes total sense--they offer no marginal benefit. In fact, if you had to make the decision "should I use log(x) or the constant model?", you're often better off going with the constant model for simplicity and predictability (unless you have domain knowledge motivating a different choice).

I like well thought-out articles like this one on statistical concepts because they make me think about things I often take for granted, and while treating R^2 like a better version of RMSE or MAD is clearly wrong, it often better captures things people actually care about by taking the difficulty of the problem into account. If you're doing advanced statistical modeling, it's easy to underestimate how common it is for a beginner to celebrate getting (say) 96% classification accuracy on a 2-class problem where one class makes up more than 95% of the samples--an issue that using R^2 can quickly reveal.

tldr: Awesome article but (imo) R^2 is more useful than most other metrics, not less

Noseshine · on July 24, 2016

    > “does my model perform worse than the true model?”

What is "true model"? I can't make head nor tail of that term. I've never heard this before, nor does it make sense to me when I take just the word meaning.

tel · on July 24, 2016

It's a bit of a philosophical thing. The data arose because of results of some large set of processes, all unobserved. Assuming a deterministic world—just for convenience of argument here—there's no actual underlying distribution which caused the data. Just a big equation with too many unknowns.

In practice, though, these systems are often well-described by distributions. In common parlance for statistics, people thus sometimes like to talk about that large set of unknowns as though it really were just some arbitrarily complex distribution which we don't know: the true model.

From a frequentist perspective, the idea of the true model is very important. Many arguments arise by considering it and trying to understand how to reduce the distance between your estimators and the unknown truth.

Bayesian foundations avoid this entirely, refusing to posit an unknown and instead talking about refinements of what we know.

obastani · on July 24, 2016

It's not just philosophical -- there are examples of practical situations where there is a true model (especially in more traditional applications of statistics):

- Polling for presidential election: The "true model" is the voting preferences of all 300 million Americans. A (uniformly) random sample of N Americans can be used to estimate the true model roughly with standard error 1/sqrt(N).

- Particle physics: The "true model" is the decay probabilities of various particles as computed by quantum mechanics. Different models (and parameters) yield different decay probabilities, and experiments can be used to choose between models and/or estimate model parameters.

Of course, oftentimes the true model is philosophical as you describe.

tel · on July 25, 2016

That's reasonably fair, but it's worth noting that even in these situations the true circumstance might be a little more nuanced than the idea of a "true model" suggests.

- In polling, it's a bit of an ideal world idea to think that the true voting preferences of all Americans are (1) fixed, (2) consistently measurable, or (3) even relevant given a lot of people won't vote. These are all sources of unknowns and variances which make me pretty unhappy with the idea of the idea of a true model even here.

- In physics there's definitely a notion of a true model taken under the assumption that one set of equations and models is "correct" (which is its own philosophical problem!) but even then experiments don't measure this true model perfectly. They're also based on correction for variance in measurement and tooling which is assumed to be eventually ignorable. In any case, this has a much more definite notion of a true model, but I still find it difficult to swallow all together.

dllthomas · on July 27, 2016

> (3) even relevant given a lot of people won't vote

This bit seems trivially resolved by interpreting "voting preferences" to include likelihood of voting.

tel · on July 29, 2016

You can definitely model that, too, but it leads you even further away from being able to say that the "true model" is a physical thing of any nature.

dllthomas · on Aug 5, 2016

Ah, yeah, fair point - that does move us away from the physical.

nabla9 · on July 24, 2016

True model is the probability distribution that generates the observed data.

DiabloD3 · on July 24, 2016

In other words, what was actually observed?

irremediable · on July 24, 2016

Not really. What is observed is data, not a model/distribution. One can fit a given model to this data, but they do not, on their own, tell you which model generated them.

enthdegree · on July 24, 2016

Maybe you've gotten it but I don't know. Here is a maybe over-tired coin-flipping example:

Say you have a coin that might be unfair, and you want to estimate its bias. You flip it a bunch of times, and it mostly lands heads.

Predictions of the coin's bias based on this observed data are usually going to be that it's biased some % towards heads. (unless maybe you have a strong prior, but that's a different topic)

But there is also a chance that really the coin is fair, or even biased towards tails and you just got unlucky in your flips.

There's a mismatch between the true model (the coins actual bias) and the observed data (the result of your flips) because of this chance of `unluckiness'

nabla9 · on July 24, 2016

No. "true model", "true distribution" and "true population" is what generates the data.

jsprogrammer · on July 24, 2016

Which is a misnomer, because the data probably isn't generated by a model.

tomrod · on July 24, 2016

Semantics at this point. Data-generating process is a term that's also used. A model seeks to mimic or match the real process to a reasonable approximation. Hence "true model" as a scoring engine or data relationship that represents reality completely.

jsprogrammer · on July 24, 2016

Except that we can't access reality completely. Assuming that there is a "true model" generating data is just that: an assumption with no real basis.

Semantics is the essence of communication.

pessimizer · on July 24, 2016

Unless you have data, that is. The data is the basis for assuming that a process has generated data. Either that, or the data has existed for all eternity, and therefore could never have been collected.

jsprogrammer · on July 24, 2016

"data" is your own perception, however. Can you give an example of a data generating process?

tomrod · on July 24, 2016

We can create them!

Suppose I take the function y = log(x) and add random white noise. The function log() and the parameters on the random white noise process are the data generating process. We could then fit a model y = \beta X + \epsilon, and then compare the "true" (first) model to our second model. When the natural world generates our data, the idea behind all this is the same: there is a process which generates the data, and the data reveals information about that process to an approximate degree.

Some reads:

[1] http://www.rimini.unibo.it/fanelli/econometric_models2_2012....

[2] https://en.wikipedia.org/wiki/Data_generating_process

jsprogrammer · on July 24, 2016

Can you give a non-synthetic, ie. natural, example data generation process?

Edit: I don't know if people don't like the grammar, or what?

How about this:

Can you give a non-synthetic, ie. natural, example of a data generating process?

tomrod · on July 29, 2016

Sure! Most of the equations you find in your nearest physics or chemistry books are validated experimentally/empirically. The validation comes by better and better approximating the data generating process.

lottin · on July 24, 2016

The stochastic process that generated the observed data.

unfamiliar · on July 24, 2016

No, the "true model" he seems to be referring to is the mean of the distribution of the noisy observations. In the example, the distribution is the normal distribution and the mean is log(x), which he is referring to as the "true model".

The notation is pretty sloppy for someone handing out statistical advice.

nabla9 · on July 24, 2016

I don't think his usage of these terms is sloppy at all.

The main point of his argument is that R² compares to observed mean and not into true model behind the data.

Terms "true model", "true distribution" and "true population" are well defined concepts in statistics – unless Jorma Rissanen is nearby.

bunderbunder · on July 24, 2016

The true model is

  y = log(x) + u

Where u is an error/noise term, and in this case takes a normal distribution. The error term is meant to Hoover up all the stuff you can't account for because it's impossible to measure everything. As ever, what you're trying to estimate is the parameters for everything but the error term.

He's maybe being a little sloppy with the notation, but TBH it's a pedantic distinction that I haven't seen anyone make a big deal of since I last took intro stats. In R, for example, you'd specify the model as "y ~ log(x)" and leave it at that.

yiyus · on July 24, 2016

In the example, log(x) is the true model. It is generally unknown.

scottfr · on July 24, 2016

I'm a huge fan of R^2 and you should be too.

The simple way to think about R^2 is that it is a measure of relative predictive accuracy (and that is exactly how it is calculated). This is both a more accurate (and a more useful definition for most tasks the HN crowd would work on) than saying R^2 is a measure of the distance from the true model.

All the figures and findings in the post are completely reasonable given that definition (the linear model performing better in the first case is due to the high variance of the author's data, other samplings would lead to the reverse result).

mrandrewandrade · on July 24, 2016

I think what John is saying is that people commonly use R^2 as measure of model fit where something like root mean squared error (RMSE) gives a better measure of model fit (by measuring the distance from the true model) depending on the model. Just using R^2 blindly for most tasks you would work on can lead to choosing an incorrect model.

I think the main take-away from the post is to better understand the correct measure for model fit for your specific data. For example, if you are forecasting a time series with stationary demand, mean average deviation might be the best measure of model fit, but it the case where there is seasonality with a trend, the RMSE would be a better measure [1].

[1] http://robjhyndman.com/papers/foresight.pdf

robzyb · on July 24, 2016

> I think what John is saying is that people commonly use R^2 as measure of model fit where something like root mean squared error (RMSE) gives a better measure of model fit (by measuring the distance from the true model)

I don't mean to be rude, but that is definitely not what he is saying. There are two important things I'd like to clarify:

- It is wrong to call the alternative measure "the RMSE". The alternative that the article was proposing was a made-up measure called E^2 which measures the distance from the true model.

- The author is not suggesting that people use E^2 instead of R^2 for any case. In fact, in almost all cases it is impossible to use E^2, because calculating it requires you to know the true model, and if you know the true model it would be very unlikely that you'd be wasting your time measuring other models.

The author makes it clear that E^2 isn't really to be considered an alternative when he called it the "generally unmeasurable E^2".

mrandrewandrade · on July 24, 2016

I think we are saying the same thing in different words, and I might be confusing "an alternative" with "a comparison". John compares R^2 with E^2, but RMSE can be considered an alternative to using R^2 in certain cases.

If you go back to the first line: > People sometimes use R^2 as their preferred measure of model fit.

I think the post is going over why R^2 is not recommended as 2 is not only a measure of the error, but it includes a comparison with a constant model. John defines E^2 as a comparison metric which measures how much worse the errors are than if you used the true model.

Going back to a metric for determining model fit, RMSE/MSE/MAD are all alternative measures of model fit and are useful depending on the dataset.

sixbrx · on July 24, 2016

An interesting tangential argument I had with a colleague recently was how to actually calculate R^2 when testing model predictions on new data (emphasis on new data).

He claimed the conventional way is as the square of the correlation coefficient of the two lists of data, whereas I was suggesting 1 - SSE/SST in accord with references I read (well, Wikipedia), which latter method yields lower values for his data. I know they are equal on training data when using linear regression with constant term, but generally they differ. Correlation way being scale and shift invariant to me disqualifies it as a measure for predicting new data.

Unfortunately he was able to barrage me with many online references that just say that "R^2 is the square of the correlation coefficient", which without very careful reading of the context (fitting not predicting), and sometimes even with such careful reading, makes his interpretation look correct. I find the whole thing rather exasperating...

It also occurs to me that just as a question of convention, that I may be wrong which wouldn't surprise me as I'm new to modeling.

So: Do most modelers report correlation^2 as their R^2 values for holdout tests? I wonder have other modelers here encountered this confusion?

kgwgk · on July 24, 2016

I find this very confusing, but I guess I'm not the intended target audience. Not that I say it's wrong, but I don't really see the point.

Do people really expect R^2 to measure the fit of the model to the true model? R^2 measures the fit of the model to the data: i.e. how well does the model perform in predicting the outcomes. In his first example is clear that all the models are equally useless: the noise dominates and the predictive power of the models is close to zero. In the second example the predictive power of all the models has improved, because there is a clear trend. The true model predicts much better than the others now, but each model predicts better than in the previous example.

In the first example, he concludes: "Even though R^2 suggests our model is not very good, E^2 tells us that our model is close to perfect over the range of x."

Actually our model is "better than perfect". The R^2 for the linear model (0.0073) and for the quadratic model (0.0084) is slightly better than for the true model (0.0064). Of course this is not a problem specific to the R^2 measure (the MSE for the linear and quadratic fits is lower than for the true generating function) and can be explained because the linear and quadratic models overfit. E^2 is essentially the ratio the 1-R^2 values (minus one). We get -0.00083 and -0.00193 for the linear and quadratic models respectively (the ratios before substracting one are 0.9992 and 0.9981).

In the second example,"visual inspection makes it clear that the linear model and quadratic models are both systematically inaccurate, but their values of R^2 have gone up substantially: R^2=0.760 for the linear model and R^2=0.997 for the true model. In contrast, E^2=85.582 for the linear model, indicating that this data set provides substantial evidence that the linear model is worse than the true model."

The R^2 already indicates that the linear model (R^2=0.760) and the quadratic model (R^2=0.898) are worse than the true model (R^2=0.997). The fractions of unexplained variance are 0.240, 0.102 and 0.003 respecively and it's clear that the last one performs much better than the others before we take the ratios and substract one to calculate the E^2 values 85.6 and 35.7 for the linear and quadratic models respectively.

(By the way: "we’ll work with an alternative R^2 calculation that ignores corrections for the number of regressors in a model." That's not an alternative R^2, that's the standard R^2. The adjusted R^2 that takes into account the number of regressors is the alternative one.)

kgwgk · on July 24, 2016

There is now a new definition for E^2 (one minus the ratio of the R^2 of the model and the "true" model) which doesn't solve the most obvious issue: getting negative values for a measure called "something squared". The values of E^2 in the first example are now -0.13 for the linear model and -0.30 for the quadratic model. In the second example, they are 0.24 and 0.10 respectively.

The graphical representation is a bit misleading. Leaving aside the fact that in the first example MSE_T is between MSE_M and MSE_C, this drawing make E^2 and R^2 seem more complementary than they really are. E^2 is the length of the blue bar as a fraction of the total length (blue+orange). R^2, however, is the length of the orange bar as a fraction of the distance from the end of the bar to the origin (not shown in the chart).

Edit: there is a new addition to the post, re-expressing E^2 in terms of a mean/variance decomposition. It should be kept in mind that the derivation presented is only asymptotically correct. In a small sample, the cross term does not vanish and the variance of the observations around the "true" value is not exactly sigma^2. In the second example, E^2 calculated using this new definition is quite similar (0.2373 and 0.0991 for the linear and quadratic models, compared to the previous values of 0.2382 and 0.0994). In the first example, however, the values we get from the new definition are far from the previous values: 0.0646 vs -0.129 for the linear model, 0.1528 vs -0.297 for the quadratic model.

Edit2: changed "approximation" to "new definition", "good" to "similar" and "exact" to "previous" in the previous paragraph. I'm not sure if he was suggesting to use this formula to calculate E^2 instead of the previous one. Anyway, it doesn't matter because this is not something that can be calculated at all unless the "true model" is known.

perturbation · on July 24, 2016

I think that this example in particular is not the best for R^2. He's getting a really good fit for linear (especially when his first plot is centered in a narrow range), since log(x) has a nice Taylor expansions for log(x) ~ x - 1 in that region.

For fits that are almost entirely close to the mean (no slope) I would expect to be saved by the F-test, but we're not here since there's a region where a linear fit fits the data at least somewhat well.

graeham · on July 24, 2016

Interesting article and I find it current for some problems I'm working on at the moment.

I would add a few challanges. The example is a bit a of a strawman - a log(x) function has unique properties that make the Xmax-Xmin vs R^2 work like that. In real data, rarely does a single-variable 'true model' fit as well as the example either.

Context is needed as well - depending on the use of the model, a linear or quadratic fit may be sufficient even for what is clearly a log dataset. The real failing on only for small values of x, maybe 5% of the range of total values. For this case, a bilinear model could fit quite well for the lower 5%, then the existing model for the upper 95%. It depends on the application. I like this phrase:

"When deciding whether a model is useful, a high R2 can be undesirable and a low R2 can be desirable."

Too often statistics are dominated by 'cutoff' values that people apply blindly to all situations.

What do you think of robust regression methods, where obvious outliers are down-weighted?

yummyfajitas · on July 24, 2016

I'm not the author, but I'm a huge fan of robust regression. I make between $500-2000/month off a trading strategy based on such a method. (The method is basically Bayesian linear regression, but using an error model that has a heavier tail than a gaussian.)

But a really important thing when using such methods is the lucas critique. When you need to use robust regression you are definitively in a space where all the simple and generic stuff (e.g. linear regression) doesn't work. So at this point it's important to validate the underlying assumptions behind the robust regression scheme.

E.g., in my trading strategy, I've gone to great lengths to make sure the tail behavior I'm modelling is an overestimate of reality.

xoranth · on July 26, 2016

Could you share more details about the robust regression you are using? All resources I could find online on robust regression would either point to Laplace distributed residuals, or some capped loss function.

Rexxar · on July 24, 2016

That's seem an advantage for R²: if R² go down when you add more data, you know your model doesn't fit the data any more.

tgb · on July 24, 2016

Had the author offered an alternative? Namely, can E^2 be calculated in practice?

a_bonobo · on July 24, 2016

I don't think you can calculate E^2 without the "true model", which you practically never have. The code uses the "true model" too: https://github.com/johnmyleswhite/r_squared/blob/master/util...

I guess the post is similar to Anscombe's quartet [1]: a warning not to blindly trust summary statistics.

[1] https://en.wikipedia.org/wiki/Anscombe%27s_quartet

static_noise · on July 24, 2016

So the solution the author proposes is both absolutely correct and absolutely useless in practice?

lottin · on July 24, 2016

He's not proposing a solution as far as I can tell. He's simply using the E^2 statistic in order to illustrate the problem.

data_scientist · on July 24, 2016

My main issue with R^2 is that it is an artificial indicator. It has no business meaning. The best alternative is to build a metric that is related to your business issue. How many $ do you loss if you're wrong by 1%? Or by 10 units? Do your business loss is linear in your error? Is it symmetrical?

These are the kind of questions you have to ask yourself. Defining a metric is hard, and there is no good shortcut.

yiyus · on July 24, 2016

The author says that E2 is "generally unmeasurable". Indeed, the true model will be unknown in most cases.

In my opinion, it is not about using E2 or other alternative measure to R2, but about being aware of the significance and validity of R2 and not blindly assuming that a higher value always means a better fit.

otl1248 · on July 24, 2016

This is a weird critique of R-Squared. You (should) learn very early on that comparing R-Squared values of different models is a no-no.

achompas · on July 24, 2016

And you're claiming it's weird because people learn this early on? Maybe that's where we disagree -- it's not as common to learn about the issues with R^2 as you might think.

otl1248 · on July 24, 2016

Maybe the fact that this is in the (reasonably brief) Wikipedia entry on R2 is some evidence?

https://en.wikipedia.org/wiki/Coefficient_of_determination#I...

Also, I find that the OP is actually more confusing on this topic than Wikipedia.

Not sure how else I can support that this is a basic fact about this metric. I'm not in the mood to find quotes in intro textbooks, etc.

jostmey · on July 24, 2016

Learn about information theory. It is better to calculate the cross-entropy error or KL-divergence between the data and the model.

chestervonwinch · on July 24, 2016

Do you have any links where KL-divergence is used in the case of real-valued response variables? Why is it better?

acbart · on July 24, 2016

This is not very accessible for people with weak statistical backgrounds.

yiyus · on July 24, 2016

R-Squared is the most commonly used indicator of how good a model fits some data. The author discusses why this indicator can be misleading in some cases and shows an example. There is nothing interesting for people unfamiliar with R-Squared values or model fitting.

But in fact, it is not complicated. If you feel curious, feel free to ask any specific question you have.