Modes, Medians and Means: A Unifying Perspective (2013)

bo1024 · on July 11, 2020

One can generalize this approach further to take any given loss function, not just a p-norm, and ask what statistic is achieved by minimizing loss.

A field that studies this is called property elicitation, from the idea that the loss function elicits a certain kind of prediction be it mean, mode, etc.

This field actually originated with human experts like weather forecasters - what is a loss function (or “scoring rule”) that incents them to predict accurately? Squared loss is good if by accurately we imply we want the mean.

One fact people find surprising is that squared loss is not the only one that elicits the mean. There is an entire family of losses that do, including KL divergence.

CrazyStat · on July 11, 2020

This approach is also used to extend concepts like the median into higher dimensions and non-euclidian spaces.

What is the "median" or "mean" of a set of points located on the surface of a sphere, or a torus, or some more complicated manifold (which may only be defined implicitly by a distance function)? If you try to start from the one-dimensional definition it's not really clear. You can define such concepts, however, by minimizing the corresponding norms. If you minimize the sum of L1 norms then you get something like a median. If you minimize the sum of L2 norms then you get something like a mean. Once you have something like a mean you can also define something like a variance.

Wikipedia has a short article about the concept [1]. It really deserves more treatment, as it's a central (no pun in intended) idea in topological data analysis [2].

[1] https://en.wikipedia.org/wiki/Fr%C3%A9chet_mean

[2] https://en.wikipedia.org/wiki/Topological_data_analysis

chestervonwinch · on July 11, 2020

Right! A couple more interesting (IMO) example of this:

1. The geometric mean is the minimizer of:

L(x) = Σ [ log(x_i) - log(x) ]²

2. The so-called Jaccard median is the minimizer of:

L(x) = Σ [ 1-J(x_i, x) ]

where J is the Jaccard overlap score (aka, IoU = Intersection over Union). The second is interesting because the Jaccard score is often used to describe how close two segmentations are in image processing. The Jaccard median can be used to consolidate multiple segmentations of the same object in such a way that is best under the Jaccard distance.

andreareina · on July 11, 2020

I've come back to this very page a number of times over the years and I've always been dissatisfied with how it glosses over why E_1 implies we should take the median, and E_2 implies we should take the mean.

Assume that we have some value of s_1, with elements x_lt \in {x_1, x_2, ... x_k}, x_lt < s_1 and x_gt \in {x_k+1, x_k+2, ... x_n}, x_gt > s_1 (ignoring the case where any element is equal to x_1 to make the reasoning easier). If we move s_1 + epsilon increases E_1 by epsilon * |X_lt| and decreases E_1 by epsilon * |X_gt|, this implies that E_1 is minimized when the number of elements of X on either side of s_1 is equal (now that I've written this out I think that's what 'ogogmad is getting at).

I'm still trying to develop the intuition for why minimizing E_2 implies taking the mean.

leereeves · on July 11, 2020

The intuition isn't clear to me either, but the calculus is fairly simple. At the minimum, the derivative of the aggregate discrepancy is zero:

0 = d/ds sum (x_i - s)^2 = -2 sum(x_i - s) = -2 [(sum x_i) - n * s]

Thus

n*s = sum x_i

so

s = sum x_i / n

ImaCake · on July 11, 2020

I find turning the math into working code sometimes helps me. The for-loop in my code only approximates the mean within +- 0.5 intervals. This is because it relies on the definition for finding the mean provided in OP's article, rather than your derivation of the arithmetic mean.

import numpy as np

X = [0,1,2,3,3,3,4,5,6,6,6]

mean_dict = {}

for i in [x/10 for x in range(0,100)]:

    mean_dict[sum([abs(x-i)**2 for x in X])] = i

print(mean_dict[min(mean_dict)], "score:", min(mean_dict))

#add actual mean to series

mean_dict[sum([abs(x-np.mean(X))2 for x in X])] = np.mean(X)

print("real mean:", mean_dict[min(mean_dict)], "score:", min(mean_dict))

mkl · on July 11, 2020

You can get proper code formatting by indenting two spaces.

mrtnmcc · on July 11, 2020

On the median (L1 loss), lots of good information in the wiki page, https://en.wikipedia.org/wiki/Least_absolute_deviations

nabla9 · on July 11, 2020

>I'm still trying to develop the intuition for why minimizing E_2 implies taking the mean.

Maybe geometric motivation helps.

Consider two vectors

X = [x2,x2,...xN],

S = [(x1 - s), (x2 - s), ... , (xN - s)]

How is finding the shortest euclidean distance between X and S related to mean?

CrazyStat · on July 11, 2020

I don't understand what you're trying to get at here. Surely the shortest euclidian distance between X and S is obtained by setting s=0, which has nothing to do with the mean.

yiyus · on July 11, 2020

In nabla9's example, S would be the "vector discrepancy". If you want to minimize the discrepancy of s and X (NB. s and X, not S and X, I think this was a mistake), you want this vector to be as small as possible. In other words, you want the magnitude of the S vector to be minimal. This magnitude is given by the norm of the vector, sqrt(sum((S_i)^2)), which will reach a minimum when sum((S_i)^2) reaches a minimum.

CrazyStat · on July 11, 2020

Ok, but how does that provide intuition for sum((S_i)^2) reaching a minimum at the mean of the x values?

mrtnmcc · on July 11, 2020

Can take the derivative and set it to zero and will get the definition for mean.

CrazyStat · on July 11, 2020

Sure, but that's not intuition.

heinrichhartman · on July 11, 2020

If you model x_i as a random sequence generated by a random variable:

    X ~ s + eps

Where eps is normal distributed with mu=0,sigma>0. The probability density of the observation x_1, \dots, x_n is:

   p(x_1,\dots,x_n) = C * exp(-(s-x_1)^2) * exp(-(s-x_2)^2) ... * exp(-(s-x_n)^2)

Taking the log you get

   Log(p(x_1,\dots,x_n)) = - C' * sum_i (s-x_i)^2

So the probability density of your observation is maximized if the sum-of-squares is minimized.

This means if you set s to mean(x_i), you will maximize the "log-likelyhood" of your observation.

CrazyStat · on July 11, 2020

This depends on the Normal distribution, so I don't think it's really very helpful for developing an intuition of the general case.

CrazyStat · on July 11, 2020

Here's a shot at explaining it in terms similar to your explanation for E_1.

-------------------

As above we'll have some value of s_2, and a set of points x_1 < x_2 < ... x_k < s_2 < x_{k+1} < ... < x_n. We define E_2 = \sum (x_i - s_2)^2.

Suppose we move s_2 slightly to the right, obtaining s_2 + epsilon. The first k terms of the sum will increase, while the last (n-k) terms will decrease. Calculus tells us that each term will change by -2 (x_i-s_2) epsilon + O(epsilon^2). If epsilon is small (infinitesimal) we can ignore the epsilon^2 term. As expected this quantity is negative for points to the right of s_2 and positive for points to the left of s_2. We can add these up to find the total change to be

-2 × epsilon × \sum (x_i-s_2),

dropping the higher order term.

If s_2 is too far to the left, moving it to the right will decrease E_2 as the decreases from points x_{k+1}, ... x_n will be larger than the increases from points x_1, ..., x_k. Similarly if s_2 is too far to the right, moving it to the left will decrease E_2. So where's the point where we can't gain anything by moving left or right?

If we look at our expression above,

-2 × epsilon × \sum (x_i-s_2)

as a function of s_2, it's negative when s_2 is small and positive when s_2 is large. Negative means we can do better by increasing s_2; positive means we can do better by decreasing s_2. It's just a straight line, so if it's negative on one side and positive on the other side there has to be some point where it's exactly zero. At that point we can't improve E_2 at all by moving in either direction. To find that point we simply set

-2 × epsilon × \sum (x_i-s_2) = 0

and then some simple algebra gives us

s_2 = \sum(x_i)/n

-----------------

There's also a purely algebraic way to see that E_2 is minimized at the mean, without requiring any calculus. Whether it's helpful for intuition or not I don't know.

Suppose we have some value s'_2 that we're interested in, and we think this might minimize E_2:

E'_2 = \sum(x_i - s'_2)^2

I'm going to prove to you that E'_2 is not smaller than \sum(x_i - m)^2, where m is the mean. We can write

E'_2 = \sum(x_i - m + m - s'_2)^2,

since adding and subtracting m in the middle doesn't change anything. Then split it up

E'_2 = \sum((x_i - m) + (m - s'_2))^2

and expand the square

E'_2 = \sum(x_i - m)^2 + \sum(m - s'_2)^2 + 2 × \sum[(x_i - m)(m - s'_2)]

The first term is the one that I said E'_2 was not smaller than. The second term is the sum of a bunch of [identical] squares, so it is nonnegative. The third term it turns out is zero, because we can pull the (m - s\*_2) term out of the sum and then \sum(x_i - m) is zero. So we have

E'_2 = \sum(x_i - m)^2 + {nonnegative stuff},

which proves that s'_2 is not better than m. In fact the second term is strictly positive unless s'_2 = m, so m is the unique minimizer.

wenc · on July 12, 2020

Side note: you can use this to convert LaTeX to Unicode for easier reading.

https://www.unicodeit.net/

ogogmad · on July 11, 2020

Interestingly, since the L^1 discrepancy is convex, it's possible to analytically minimize it by using the subdifferential, and thereby prove that its minimum is attained at the median. The subdifferential is a set-valued generalisation of the derivative to all convex functions, including non-differentiable ones. See https://en.wikipedia.org/wiki/Subderivative and https://towardsdatascience.com/beyond-the-derivative-subderi...

Whenever the subdifferential of a convex function at a point includes 0, that point is a global minimum of the function.

greens · on July 11, 2020

L^1's derivative is a perfectly good function, it's not defined or continuous at 0, but whatever... it's for the same reason that the median handles even-sized sets in a special way.

0 = d/ds \sum_i | x_i - s | = \sum_i sign(x_i - s)

which is true precisely when x_i <= s for half the elements and x_i >= s for the other half (taking sign(0) = 0, sign(+x) = 1, sign(-x) = -1).

Nesco · on July 11, 2020

A derivative can also be obtained for the abs function interpreted as a distribution / generalized function. However I don’t think it is helpful for calculus because of the dirac measure that pop-up

amitport · on July 11, 2020

"Whenever the subderivative of a convex function at a point includes 0, that point is a global minimum of the function"

* you mean a strictly convex function

ogogmad · on July 11, 2020

In the non-strictly convex case, the minimum is still a global minimum; it's just not unique.

amitport · on July 11, 2020

[edit] right, you're correct sorry

amitport · on July 11, 2020

the wikipedia version for anyone interested: https://en.wikipedia.org/wiki/Central_tendency#Solutions_to_...

missing from the article is that using L_inf will give you the midrange

ImaCake · on July 11, 2020

Thanks this is a great way to extend and generalise OP's post. Thanks for sharing. I don't think I would have been able to understand the wikipedia argument without the more intuitive explanation in the original post.

As an aside, I wonder if math is sometimes difficult to understand/explain because the language is so constrained. Obviously, careful use of words is important for translating mathematical definitions verbatim, but I think a more relaxed approach to word translations of math equations would be beneficial to everyone below second year math/statistics at university. Even if they are left with a technically incomplete/incorrect definition or intuition for a given concept.

tomrod · on July 11, 2020

I like Mr. White's writing. I interacted with him a bit several years ago with Julia's dataframe packages. He is a brilliant guy.

I think this insight (measures of centrality) and an article posted 4-5 years back to HN tying quantum theory as a separate branch of statistics (statistics under a different Lp norm) really helped tie together a lot of quantitative thread for me.

tobbe2064 · on July 11, 2020

Do you hace a link to the article? It sound very interesting

sls · on July 11, 2020

Not the person to whom you were speaking but their description reminds me of Scott Aaronson's work e.g. https://www.scottaaronson.com/democritus/lec9.html

which has been discussed on HN https://news.ycombinator.com/item?id=19161028

tobbe2064 · on July 11, 2020

Thanks! Looks really interesting

cvigoe · on July 11, 2020

For anyone interested in seeing this in the context of statistical decision theory, see the final page of this lecture notes pdf:

http://www.stat.cmu.edu/~siva/705/lec16.pdf

In particular, for the parametric estimation setting, it can be shown that the Bayes Estimator under L_0 loss corresponds to simply finding the posterior distribution of a parameter given data, and then finding the mode of this distribution. Similarly, for L_1 loss, all we need do is find the median of the posterior distribution. And under L_2 loss, it’s just the expectation of the posterior. CMU’s 705 course is a great intro to statistical decision theory and stats more broadly for anyone interested!

(Disclaimer: I am a CMU PhD student in the machine learning department so I am somewhat biased to thinking these notes are good having taken this course myself)

kevinventullo · on July 11, 2020

And minimizing E_∞ means taking the midpoint between the max and min values!

pps43 · on July 11, 2020

I like the elevator bank analogy. You are waiting for an elevator. There are four equidistant elevator doors, all arranged in one line. Elevator #3 is out of order. Where do you stand?

Turns out the answer depends on what you want to optimize - maximum walking distance, average walking distance, or average square walking distance.

k__ · on July 11, 2020

Does this mean we can calculate when to use mean, mode, or median and don't need to have an intuition for it?

Caculate every value, check which has the lowest discrepancy and be done.

JulianWasTaken · on July 12, 2020

Part of the point is any of those metrics are the correct one if they model the discrepancy you want.

You pick the one that treats discrepancy the way that's appropriate.

E.g., using the mode is like saying "if you're not the right answer, you're all equally bad".

Using the median is like saying "if you're not the right answer, you're as wrong as however far off you are".

Using the mean is like saying "if you're not the right answer, you're more and more wrong the further away you are, and the amount you're penalized itself gets more and more, so you're really incentivized to be closer"

Which of these is appropriate depends on the real thing you care about in whatever your numbers actually signify.

k__ · on July 14, 2020

Could you elaborate the difference between mean and median a bit more?

buddhiajuke · on July 11, 2020

This is true, but isn’t “it”.

If ten boxes fall at random from an Amazon Prime truck and you estimate a mean value of the parcels per m^3, you can extrapolate that the total value of the loot in the truck.

gotoeleven · on July 11, 2020

There's further discussion of these ideas here: https://ram.rachum.com/median/

dang · on July 11, 2020

If curious see also

2017 https://news.ycombinator.com/item?id=15946239