Hacker News new | past | comments | ask | show | jobs | submit login

I find this very confusing, but I guess I'm not the intended target audience. Not that I say it's wrong, but I don't really see the point.

Do people really expect R^2 to measure the fit of the model to the true model? R^2 measures the fit of the model to the data: i.e. how well does the model perform in predicting the outcomes. In his first example is clear that all the models are equally useless: the noise dominates and the predictive power of the models is close to zero. In the second example the predictive power of all the models has improved, because there is a clear trend. The true model predicts much better than the others now, but each model predicts better than in the previous example.

In the first example, he concludes: "Even though R^2 suggests our model is not very good, E^2 tells us that our model is close to perfect over the range of x."

Actually our model is "better than perfect". The R^2 for the linear model (0.0073) and for the quadratic model (0.0084) is slightly better than for the true model (0.0064). Of course this is not a problem specific to the R^2 measure (the MSE for the linear and quadratic fits is lower than for the true generating function) and can be explained because the linear and quadratic models overfit. E^2 is essentially the ratio the 1-R^2 values (minus one). We get -0.00083 and -0.00193 for the linear and quadratic models respectively (the ratios before substracting one are 0.9992 and 0.9981).

In the second example,"visual inspection makes it clear that the linear model and quadratic models are both systematically inaccurate, but their values of R^2 have gone up substantially: R^2=0.760 for the linear model and R^2=0.997 for the true model. In contrast, E^2=85.582 for the linear model, indicating that this data set provides substantial evidence that the linear model is worse than the true model."

The R^2 already indicates that the linear model (R^2=0.760) and the quadratic model (R^2=0.898) are worse than the true model (R^2=0.997). The fractions of unexplained variance are 0.240, 0.102 and 0.003 respecively and it's clear that the last one performs much better than the others before we take the ratios and substract one to calculate the E^2 values 85.6 and 35.7 for the linear and quadratic models respectively.

(By the way: "we’ll work with an alternative R^2 calculation that ignores corrections for the number of regressors in a model." That's not an alternative R^2, that's the standard R^2. The adjusted R^2 that takes into account the number of regressors is the alternative one.)




There is now a new definition for E^2 (one minus the ratio of the R^2 of the model and the "true" model) which doesn't solve the most obvious issue: getting negative values for a measure called "something squared". The values of E^2 in the first example are now -0.13 for the linear model and -0.30 for the quadratic model. In the second example, they are 0.24 and 0.10 respectively.

The graphical representation is a bit misleading. Leaving aside the fact that in the first example MSE_T is between MSE_M and MSE_C, this drawing make E^2 and R^2 seem more complementary than they really are. E^2 is the length of the blue bar as a fraction of the total length (blue+orange). R^2, however, is the length of the orange bar as a fraction of the distance from the end of the bar to the origin (not shown in the chart).

Edit: there is a new addition to the post, re-expressing E^2 in terms of a mean/variance decomposition. It should be kept in mind that the derivation presented is only asymptotically correct. In a small sample, the cross term does not vanish and the variance of the observations around the "true" value is not exactly sigma^2. In the second example, E^2 calculated using this new definition is quite similar (0.2373 and 0.0991 for the linear and quadratic models, compared to the previous values of 0.2382 and 0.0994). In the first example, however, the values we get from the new definition are far from the previous values: 0.0646 vs -0.129 for the linear model, 0.1528 vs -0.297 for the quadratic model.

Edit2: changed "approximation" to "new definition", "good" to "similar" and "exact" to "previous" in the previous paragraph. I'm not sure if he was suggesting to use this formula to calculate E^2 instead of the previous one. Anyway, it doesn't matter because this is not something that can be calculated at all unless the "true model" is known.


I think that this example in particular is not the best for R^2. He's getting a really good fit for linear (especially when his first plot is centered in a narrow range), since log(x) has a nice Taylor expansions for log(x) ~ x - 1 in that region.

For fits that are almost entirely close to the mean (no slope) I would expect to be saved by the F-test, but we're not here since there's a region where a linear fit fits the data at least somewhat well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: