Hacker News new | past | comments | ask | show | jobs | submit login

I've come back to this very page a number of times over the years and I've always been dissatisfied with how it glosses over why E_1 implies we should take the median, and E_2 implies we should take the mean.

Assume that we have some value of s_1, with elements x_lt \in {x_1, x_2, ... x_k}, x_lt < s_1 and x_gt \in {x_k+1, x_k+2, ... x_n}, x_gt > s_1 (ignoring the case where any element is equal to x_1 to make the reasoning easier). If we move s_1 + epsilon increases E_1 by epsilon * |X_lt| and decreases E_1 by epsilon * |X_gt|, this implies that E_1 is minimized when the number of elements of X on either side of s_1 is equal (now that I've written this out I think that's what 'ogogmad is getting at).

I'm still trying to develop the intuition for why minimizing E_2 implies taking the mean.




The intuition isn't clear to me either, but the calculus is fairly simple. At the minimum, the derivative of the aggregate discrepancy is zero:

0 = d/ds sum (x_i - s)^2 = -2 sum(x_i - s) = -2 [(sum x_i) - n * s]

Thus

n*s = sum x_i

so

s = sum x_i / n


I find turning the math into working code sometimes helps me. The for-loop in my code only approximates the mean within +- 0.5 intervals. This is because it relies on the definition for finding the mean provided in OP's article, rather than your derivation of the arithmetic mean.

import numpy as np

X = [0,1,2,3,3,3,4,5,6,6,6]

mean_dict = {}

for i in [x/10 for x in range(0,100)]:

    mean_dict[sum([abs(x-i)**2 for x in X])] = i
print(mean_dict[min(mean_dict)], "score:", min(mean_dict))

#add actual mean to series

mean_dict[sum([abs(x-np.mean(X))2 for x in X])] = np.mean(X)

print("real mean:", mean_dict[min(mean_dict)], "score:", min(mean_dict))


You can get proper code formatting by indenting two spaces.


On the median (L1 loss), lots of good information in the wiki page, https://en.wikipedia.org/wiki/Least_absolute_deviations


>I'm still trying to develop the intuition for why minimizing E_2 implies taking the mean.

Maybe geometric motivation helps.

Consider two vectors

X = [x2,x2,...xN],

S = [(x1 - s), (x2 - s), ... , (xN - s)]

How is finding the shortest euclidean distance between X and S related to mean?


I don't understand what you're trying to get at here. Surely the shortest euclidian distance between X and S is obtained by setting s=0, which has nothing to do with the mean.


In nabla9's example, S would be the "vector discrepancy". If you want to minimize the discrepancy of s and X (NB. s and X, not S and X, I think this was a mistake), you want this vector to be as small as possible. In other words, you want the magnitude of the S vector to be minimal. This magnitude is given by the norm of the vector, sqrt(sum((S_i)^2)), which will reach a minimum when sum((S_i)^2) reaches a minimum.


Ok, but how does that provide intuition for sum((S_i)^2) reaching a minimum at the mean of the x values?


Can take the derivative and set it to zero and will get the definition for mean.


Sure, but that's not intuition.


If you model x_i as a random sequence generated by a random variable:

    X ~ s + eps
Where eps is normal distributed with mu=0,sigma>0. The probability density of the observation x_1, \dots, x_n is:

   p(x_1,\dots,x_n) = C * exp(-(s-x_1)^2) * exp(-(s-x_2)^2) ... * exp(-(s-x_n)^2)
Taking the log you get

   Log(p(x_1,\dots,x_n)) = - C' * sum_i (s-x_i)^2
So the probability density of your observation is maximized if the sum-of-squares is minimized.

This means if you set s to mean(x_i), you will maximize the "log-likelyhood" of your observation.


This depends on the Normal distribution, so I don't think it's really very helpful for developing an intuition of the general case.


Here's a shot at explaining it in terms similar to your explanation for E_1.

-------------------

As above we'll have some value of s_2, and a set of points x_1 < x_2 < ... x_k < s_2 < x_{k+1} < ... < x_n. We define E_2 = \sum (x_i - s_2)^2.

Suppose we move s_2 slightly to the right, obtaining s_2 + epsilon. The first k terms of the sum will increase, while the last (n-k) terms will decrease. Calculus tells us that each term will change by -2 (x_i-s_2) epsilon + O(epsilon^2). If epsilon is small (infinitesimal) we can ignore the epsilon^2 term. As expected this quantity is negative for points to the right of s_2 and positive for points to the left of s_2. We can add these up to find the total change to be

-2 × epsilon × \sum (x_i-s_2),

dropping the higher order term.

If s_2 is too far to the left, moving it to the right will decrease E_2 as the decreases from points x_{k+1}, ... x_n will be larger than the increases from points x_1, ..., x_k. Similarly if s_2 is too far to the right, moving it to the left will decrease E_2. So where's the point where we can't gain anything by moving left or right?

If we look at our expression above,

-2 × epsilon × \sum (x_i-s_2)

as a function of s_2, it's negative when s_2 is small and positive when s_2 is large. Negative means we can do better by increasing s_2; positive means we can do better by decreasing s_2. It's just a straight line, so if it's negative on one side and positive on the other side there has to be some point where it's exactly zero. At that point we can't improve E_2 at all by moving in either direction. To find that point we simply set

-2 × epsilon × \sum (x_i-s_2) = 0

and then some simple algebra gives us

s_2 = \sum(x_i)/n

-----------------

There's also a purely algebraic way to see that E_2 is minimized at the mean, without requiring any calculus. Whether it's helpful for intuition or not I don't know.

Suppose we have some value s'_2 that we're interested in, and we think this might minimize E_2:

E'_2 = \sum(x_i - s'_2)^2

I'm going to prove to you that E'_2 is not smaller than \sum(x_i - m)^2, where m is the mean. We can write

E'_2 = \sum(x_i - m + m - s'_2)^2,

since adding and subtracting m in the middle doesn't change anything. Then split it up

E'_2 = \sum((x_i - m) + (m - s'_2))^2

and expand the square

E'_2 = \sum(x_i - m)^2 + \sum(m - s'_2)^2 + 2 × \sum[(x_i - m)(m - s'_2)]

The first term is the one that I said E'_2 was not smaller than. The second term is the sum of a bunch of [identical] squares, so it is nonnegative. The third term it turns out is zero, because we can pull the (m - s\*_2) term out of the sum and then \sum(x_i - m) is zero. So we have

E'_2 = \sum(x_i - m)^2 + {nonnegative stuff},

which proves that s'_2 is not better than m. In fact the second term is strictly positive unless s'_2 = m, so m is the unique minimizer.


Side note: you can use this to convert LaTeX to Unicode for easier reading.

https://www.unicodeit.net/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: