I dont like this article. It mentions, but glosses over, the real reason squared error is used. Assuming a Gaussian error term, squared error yields the maximum likelihood estimator.
That is the core reason to use squares - if the error is not normal, it just won't work. This is why, e.g., a small number of outliers can wildly skew a least squares fit - if error was normal, they simply wouldn't exist.
I wanted add a note on it being BLUE (the best linear unbiased estimator)
"The Gauss–Markov theorem states that in a linear regression model in which the errors have expectation zero and are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator."[1]
The benefit of a linear estimator is it is a lot easier (both computationally and algorithmically) to fit/solve a linear model than a non-linear model. Thus, there are advantages to linear models. Further, despite it being called linear, the model can actually be quite complex. It only needs to be linear in the model coefficients, but the regressors/independent variables don't have to be linear. Thus, you can make one variable the square of another variable or add a variable that is the interaction of two other variables. By doing so, you can add non-linearity while still using linear methods to solve the equation.
Ok, so then once you see the value of a linear model, then the next question is why we would want the best and most unbiased. I think those are rather obvious.
> The next question is why we would want the best and most unbiased. I think those are rather obvious.
It's not obvious to me why we want an unbiased estimator actually. A Maximum-Likelihood estimate, for example, seems just as reasonable. Do you have an explanation?
Not just that, but there's also the question of why we're looking for point estimates in the first place. It isn't the correct thing to do, but no one ever gives an explanation.
Yes, all mathematical niceties don't matter if MMSE estimation is not a good estimation in practice. This is the reason Gauss invented MMSE in the first place along with the eponymous distribution, while working with astronomical observations iirc.
I'm not sure why you feel that is the "real" reason. It is a strong one, of course, but the fact that the squared error has a derivative - and a nice and simple one - is equally strong in my opinion.
Good point, that really is the key. However, itjust turns it into the almost equivalent problem of: why assume Gaussian error terms, and the rest of the article does a good job at motivating this.
Gaussian is sometimes a good approximation if the error is a sum of independent thin-tailed variables (due to the c.l.t.). That is all. The article does not, in my opinion, motivate this very well.
Nah. Least squares is necessarily optimal under a normal distribution, but that does not imply that it does not work if the distribution is not normal. It might even be optimal for some other distributions, I just cannot remember, and I have to watch the frying pan.
Least squares is disastrously bad for anything with a fat tail (e.g., power law decay). The reason is that in these cases one or two outliers will dominate the sum of the tails.
It's not optimal for any other distribution. For a general error distribution g(x), you want to maximize sum(log(g(x-x0))) (or equivalently prod(g(x-x0))) - this only turns to least squares if log(g(x)) = C + D(x-x0)^2.
I don't know of any references, most of this is just stuff that falls out pretty easily once you try and do the math, and that's probably faster than reading a book.
I.e., set up maximal likelihood, take logs, and you immediately get least squares for g(x) a gaussian. If g(x)=exp(-|x|), you get l1 minimization. Other distributions give you other things.
The article mentions the fairly familiar fact that "median is to L1 as mean is to L2" -- i.e., the mean is the point from which the total squared error is minimized, and the median is a point from which the total absolute error is minimized.
I was telling my 8-year-old daughter about that the other day, and it occurred to me that the other "measure of central tendency" children get taught about, the mode, also fits into this scheme, with a tiny bit of fudging. Define the L0 error to be the sum of the 0th powers of the errors, with the (unusual) convention that 0^0=0. In other words, the L0 error is just the number of data points you don't get exactly right. Then the mode is the value that minimizes the L0 error. So: L0 : L1 : L2 :: mode : median : mean.
(Note 1. The obvious other Lp norm to consider is with p=infinity, in which case you're minimizing the maximum error. That gives you the average of the min and max values in your dataset. Not useful all that often.)
(Note 2. What about the rest of the "L0 column" of Ben's table? Your measure of spread is the number of values not equal to the mode. Your implied probability distribution is an improper one where there's some nonzero probability of getting the modal value, and all other values are "equally unlikely". (I repeat: not actually possible; there is no uniform distribution on the whole real line.) Your regression technique is one a little like RANSAC, where you try to fit as many points exactly as you can, and all nonzero errors are equally bad. I doubt there's any analogue of PCA, but I haven't thought about it. Your regularized-regression technique is "best subset selection", where you simply penalize for the number of nonzero coefficients.)
Define it as the limit as n goes to 0 from above of the sum of abs(act - expected)^(n), i.e. of Ln. No need to muck about with zero to the zeroth power directly.
The L0/L1 norm can have surprising benefits under certain assumptions. If you know (or assume) your signal is sparse (or mostly zero) in some domain, you're just trying to minimize the nonzero terms with the L0 norm.... But that's hard so we use the L1 norm (which still guarantees exact recovery even under Nyquist). Look up the 2004 paper by Candes and Tao for more info (titled "Exact recovery ... undersampled ...")
Wow, despite studying probability for years I never quite realised that the mean minimised the L2 norm. This is despite knowing full-well the essentially equivalent fact that the mean is the orthogonal projection in L2!
The median minimising L1 is also very nice, and I did not know that. It also means that the concept of median generalises easily to higher dimensions.
Laplace and Gauss were in mad competition to find a general rule for the pattern of errors for measurements in astronomy. It's around 1800, and the medium of exchange is letters.
Laplace has recently shown that the integral of minus t squared, where t goes from minus to plus infinity has a definite answer.
Gauss starts a paper, basically says, let's just run with the arithmetic average, the usual mean, as the go-to best estimate, from a big series of measurements of a single thing, each of which has errors, because, yeah, why not.
Moments later the guy shows that squaring the errors makes it equivalent to that thing Laplace did, and we now have a general rule. His argument follows by supposing each repeated error due to a specific cause has an expected error of zero, with no particular rule of distribution, but that as the number of specific causes goes to infinity, the Law of large numbers applies (go Euler, fair play). Hence, wiggley wiggley, Laplace's thing.
Beautiful piece of work, and resting on Laplace and Euler too, two other heroic oddballs.
The best thing is the underlying explanation: Take a nearly-infinite number of different measurements of a thing, each measurement with its own pattern of rubbishness, and the average of those measurement will approach having this distribution. That bit is what makes it so broadly applicable.
This is a terrific explanation for the uninitiated - covers pretty much all the reasons I would think of.
The author mostly describes nice mathematical properties, so I'd recommend considering absolute deviation in practical analysis in many cases as well. Then your error is dominated less by few outliers in your model. Specifically, least-squares is best when you expect your error to be gaussian distributed, which isn't always the case.
You might say the example is contrived, but if the outliers were gone both estimates would give nearly the same result. The main time to prefer L_2 would be when your error is gaussian and your variance is huge, but that's going to be an uphill battle regardless.
Thanks for the kind words! I actually agree that L1 error is often more practical, and didn't mean the post to come off as prescriptive. (In fact, part of my motivation for writing it was that L1 error seemed to have much better practical properties, and I was curious why people used L2 despite this!)
Well, that's not quite accurate -- Taleb doesn't argue that we should do away with squared error, he says we should do away with the standard deviation in favour of expected deviation from the mean.
(Hrm, I'm going back and forth on whether there's a substantive difference between the concepts of variance/standard deviation as measures of a probability distribution and squared/RMS error measures... Obviously they coincide when the "prediction" is the mean of the data, but I don't think that's terribly convincing. I think it'd be perfectly reasonable to use squared error in your model fit and parametrise your Gaussian distributions by their average absolute deviation from the mean.)
When talking about deviation, you're describing the spread of a distribution. It's perfectly fine to describe your data using mean (or median) average deviation, but then fit a model on top of that minimizes the sum of squared errors. Taleb's point is that squared deviations make for a lousy descriptive statistic as it leads you to underestimate the average distance of observations from the mean.
With meager assumptions and
a standard set up,
we are in a vector space
and want a meaning of distance,
that is, a vector space norm.
The three norms people think of
first are L1, L2, and L-infinity:
L1 is from absolute values.
L2 is from squares.
And L3 is from the absolute value
of the largest value (i.e.,
the worst case).
But in addition it would be nice
if the vector space with a norm
was also an inner product space
and the norm from the inner product.
Then, right, bingo, presto, for
the standard inner product the norm
we get is L2.
Why an inner product space? Well,
with a lot of generality and
meager assumptions, we have a
Hilbert space, that is,
a complete inner product space.
The core of the proof of completeness
is just the Minkowski inequality.
Being in a Hilbert space has a
lot of advantages: E.g., we
get orthogonality and
can take projections and, thus,
get as close as possible in our L2
norm. We
like projections, e.g., the
Pythagorean theorem. E.g., in
regression in statistics,
we like that the
total sum of squares is the
sum of the regression sum of
squares and the error sum
of squares
right, the Pythagorean theorem.
We have some nice separation
results. We can use Fourier
theory. And there's more.
And there are some good convergence
results: If we converge in L2,
then we also converge in other
good ways.
One reason for liking a Hilbert
space is that the L2 real valued
random variables form a Hilbert
space, and there convergence in L2
means almost sure convergence
(the best kind) of at least
a subsequence and, often in practice,
the whole sequence. So, we
connect nicely with measure theory.
We have some representation
results: A linear operator
on a Hilbert space is just
a point in the space
applied with the inner product.
We like linear operators and
like knowing that on a Hilbert
space they are so simple.
Working with L1 and L-infinity
is generally much less pleasant.
That is, we really like Hilbert
space.
Net, we rush to a Hilbert space
and its L2 norm from its
inner product whenever we can.
You get to do projections
as in the Pythagorean theorem.
The coefficients you need in the
projections are just the
values of some inner products.
With random variables, those
coefficients are covariances,
that is, much the same as
correlations, that commonly
can estimate from data.
In the multivariate Gaussian
case, uncorrelated implies
independence.
Fourier theory is easier in
L2 than in L1. E.g., in
classic Fourier series, the
error in the approximation
is in L2 and is from the
L2 orthogonality of the
harmonics.
Yes, L-infinity can
also be nice: The
uniform limit of a sequence
of continuous functions
is continuous.
Or, with L2, often get a
Hilbert space but with
L1 or L-infinity usually
get at best just a Banach space --
that is, a complete, normed
vector space. Then, yes, can
get the Hahn-Banach theorem,
but the same thing in Hilbert
space is easier.
There is a sense in which L1 and
L-infinity are duals of
each other, but L2 is self-dual
which is nicer.
Filling in all these details and
more is part of functional analysis
101. There tough to miss at least
three books of W. Rudin:
Principles of Mathematical Analysis,
Real and Complex Analysis,
and Functional Analysis.
There's more, but I've
got some bugs to get out
of the software of my
Web pages!
I like the question -- asked
it myself at the NIST early in my career.
The answer I gave here is better
than what people told me then.
I've indicated likely most of the
main points, but my answer here is
rough and ready (I typed too fast),
and a quite polished answer is
also possible -- I just don't have
time today
to dig out my grad school
course notes, scan through Rudin,
Dunford and Schwartz,
Kolmogorov and Fomin, much of
digital filtering, much of
multi-variate statistics, etc.
I intuit that you're getting at the real answer with the self-dual. That makes a lot of sense. Also, from a practical perspective L2 is very nice because it causes the problem of error reduction to be quadratic, so it scales well.
No: If what you want is the L-infinity
norm, then go for it. A standard place
for that is numerical approximations of
special functions -- want guarantees on
the worst case error. And there is some
math to help achieve that. It's
sometimes called Chebyshev approximation.
But, in practice, the usual situation,
e.g., signal processing, multi-variate
statistics, there's no good reason
not to use L2 and many biggie reasons
to use it. E.g., for a given box
of data, commonly the better tools in
L2 just let you do better.
Or
to the customer: "If you will go for
a good L2 approximation, then we
are in good shape. If you insist
on L1 or L-infinity, then we will
need a lot more data and still
won't do as well.".
Again, a biggie example is just
classic Fourier series. Sure,
if you are really concerned about
the Gibbs phenomenon, then maybe
work on that. Otherwise, L2 is the
place to be.
E.g., L1 and L-infinity can commonly
take you into linear programming.
Generally you will be much happier
with the tools available to you
in L2.
Again, just now I just don't
have time for a more full,
complete, and polished explanation.
A really good explanation would
require much of a good ugrad
and Master's in math, with
concentration on analysis and
a wide range of applications.
I've been there, done that but
just don't have time to
write out even a good summary of
all that material here.
I've struggled with the motivation for this too, and over time I've come to the conclusion that there is a much better 1-sentence explanation:
Cost (in the real world) is generally a quadratic quantity.
If this isn't obvious, think of all the formulas you've seen for work (or energy), which is the ultimate notion of "cost" in the real world: W = mv^2/2, W = CV^2/2, W = ɛE^2/2, etc.
Furthermore, the nice thing about energy (a frequent notion of cost) is that it is independent of the basis you measure it in. Put another way, since energy is generally the squared length of a vector, its value won't change no matter how you look at the vector.
Obviously, we're not always trying to minimize work or energy, and our variables of interest aren't always the linear counterparts, but these are true surprisingly often, and so this is a nice motivation for defining cost to be quadratic.
Squaring the error makes sense to me ... having an approximation such that a sample is 2 units off is 4 times worse than having a sample 1 unit off. Building your approximation to minimize the square of the error makes more intuitive sense to me.
This is a great article! It's always surprised me that, in machine learning and statistics, we take an objective function like squared error as a "given" rather than trying to design an objective function that's tailored to what we really care about.
As the article notes, we can often use more sophisticated optimization techniques to make use of objective functions like absolute error which are less "nice" mathematically, but closer to our end goal.
There's a related idea in speech recognition that goes under the name "discriminative training." The idea is that first you train a model using a standard fast objective function like maximum likelihood to get a good "seed" model. Then, you can retrain using a more expensive objective function (maximum mutual information) which corresponds to what you care about--whether the speech recognizer got the whole sentence right. Since that function is discontinuous, MMI represents a smooth version of it with a "randomized" decoder.
The squared error is indeed a proper scoring rule, it is however important to note that it is not strictly proper, meaning that if you have two different probabilistic forecasts generating the same mean, they will perform the same. Outside the linear context, this is often not what you want, and you should look into alternatives like the logarithmic scoring rule.
Perhaps you are thinking of probabilistic regression problems, for which mean squared error is indeed not strictly proper. My favorite scoring rule for continuous variables is the continuous ranked probability score, which is actually a generalization of mean absolute error (for non-probabilistic regression).
A nice and interesting article, but describing the squared error as "merely convenient" is a bit misleading, given that most noise is Gaussian, due to the central limit theorem.
Nah. It's not like that theorem implies that we must square errors, based only on axioms which are facts of the natural world.
On the contrary, squared errors are part of the formulas taken as axioms from which to derive the theorem.
I wouldn't be surprised if we could derive a theorem whose essence is tantamount to the essence of the famous Central Limit Theorem, using some other arbitrary formula. Like error to the fourth power.
My point was that many things are normally distributed (likely due to the CLT) and the square error is the right choice in those cases. This is an empirical thing as well, in that measurement noise is often Gaussian.
The really cool thing about a lot of machine learning algorithms is their synchronicity. There are usually multiple, independent rationales for choosing or developing a particular algorithm that end up saying very deep things about information theory.
For instance, actuaries, statisticians, and "neural network guys" (a combination of computer scientists, applied mathematicians & biologists before it really concretized into a discipline) all independently invented logistic regressions (and within the discipline, they usually got invented in a couple of different contexts before they were unified within the same framework). They are all "trying" to do different things in terms of how they were thinking about the problem, but they end up with the same model structure.
Thanks for the article; I wish I had read it before competing in a Kaggle contest where the score was based on absolute deviation. I ended up finding quantile regression and its relationship to the median eventually, but this article connects all the ideas very nicely.
The author is right- square error isn't completely natural. However one great property is that it gets expected values right (so using it helps you develop unbiased estimators). I have some notes on this here: http://www.win-vector.com/blog/2014/01/use-standard-deviatio...
Man, that MathJAX (I assume) is ugly without JavaScript. '\((P_{predicted} - P_{actual})\)' instead of 'P_predicted - P_actual,' which would look fine. Heck, with <sub> one could even Do the Right Thing in HTML itself.
That is the core reason to use squares - if the error is not normal, it just won't work. This is why, e.g., a small number of outliers can wildly skew a least squares fit - if error was normal, they simply wouldn't exist.