> Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.
> The Normal Distribution is your best guess if you only know the mean and the variance of your data.
This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is in order to get the normal distribution as the maximum entropy distribution.
The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of other distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.
So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.
You are spot-on. I would just add that there's another relatively beautiful reason why, in practice, folks pick the mean and variance for their summary; it can be done online with live data, for O(1) time and space! [0] If we extend this idea, then we easily get kurtosis and skew as the next two moments, again in constant time and space, and again getting a normal distribution but now with skew and squish.
This is non-trivial; it means that we have an online algorithm which sends our measurements directly to our summaries, without worrying about how detailed our measurements are (how many samples are summarized). For comparison, taking the median/quantiles/percentiles/etc. requires either fixed-size buckets which lose precision (as seen in Prometheus histograms) or around O(log log n) space [1], which is still practical but pedantically not constant.
> This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way.
No, it's not... A Gaussian is the best way to represent your knowledge of a value if you only know the mean and variance of its value.
So if you start with a stack of data and compress it down to a mean and variance, you've discarded most of your knowledge, and are left with a Gaussian as your best guess representation.
Yes if you were to boil it down to different summary data, like a max and min, you'd end up with a different state of knowledge and a different distribution.
But given a mean and variance, the Gaussian is your best choice of distribution, and not because of the central limit theorem, but because it has maximum entropy on those constraints. You don't always even have access to the source data in the first place, maybe just the summary statistics.
I would like to push back against this in favor of the original comment. The context of this remark within the article is:
> I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal. Can you guess why it’s everywhere?
This is not a "compression of data" question. It's not an "uninformed distributional choice" question.
It's a "why is this distribution prevalent in Nature" question.
In this context, I think the CLT gives a better answer. There are a lot of averaging processes in Nature, and due to the CLT, averaging of independent perturbations must give rise to normal distributions.
It's possible to perhaps go a step deeper than the above. In some physical systems, you can look at the second moment as an energy -- like the voltage-squared in electrical systems.
In this case, due to a-priori finiteness of system energy, the gaussian distribution can make a claim to being "inevitable" by the maxent argument in OP. ("In a system characterized by finite energy E, what is the least informative distributional constraint?")
The subject of my comment is, "Why are we given mean and variance?" If you take, "We are given mean and variance" as a presupposition, then you are having a different conversation.
The big problem with the maximum entropy argument is that if you apply some transformation to your data, you will end up with a different maximum entropy distribution. For example, you may choose to express your data in terms of rate (events / time) or period (time / events). Maximum entropy won't help you here, you have to have some kind of theoretical understanding of the underlying process that justifies your choice.
The same is true for normal distributions and mean / variance, but it's such an "obvious" choice that people forget to justify their models. My experience is that the premise of the CLT is much easier to justify, and you can use that to support your use of the normal distribution.
> if you only know the mean and variance of its value.
And why are they the only things you know? Most often, because they are the things you asked for previously... rather than, as OP hinted, e.g. some facts about quantiles.
Very good points. The Normal Distribution has very nice properties, which adds to its popularity, which, if I understand you correctly, adds to the significance we place on mean and variance.
I wouldn't say that the existence of the normal distribution is necessarily the reason that we place significance on the mean and variance. The mean is incredibly natural to define when you move to measure theoretic probability (simply the integral of a function, i.e. a random variable, with respect to some measure). When you take this point of view, random variables with particular moments existing are simply functions in L^p. Further, when you move on to proving the CLT the moments give you properties of the characteristic functions that allow you to prove the CLT. These are all deep connections.
There's another interpretation of the mean (and conditional expectation) as quantities minimizing squared error. It's not surprising that squared error and variance are so similar and that these are connected.
Great comment! You touched on something that got me wondering: do you know of something similar to MGF that determines a distribution but is not based on expected values?
Well, characteristic functions are more general than MGFs (in that they are always finite) and have very useful properties, but they have a similar definition and are also based on an expectation.
This may be pedantic, but an object that determines a distribution that isn't based on expected values is the CDF. In my first probability course in undergrad I think we defined probability mass functions and probability density functions first and defined CDFs in terms of them, but from a measure theoretic point of view, the CDF is more fundamental since it is defined for continuous and discrete distributions (and also for distributions that are neither).
I'd probably choose the probability measure itself, rather than the CDF, just because you won't always have a space with a nice order topology.
Of course the expectation value and the probability measure are really two sides of the same coin, the expectation value being equivalent to integration with respect to the probability measure.
One thing I'd add to this is that this kind of thinking makes your coordinate system really matter.
Consider a measurement of some uncertainly sized cubes. You could describe them with their edge length or their volume. Learning one tells you the other. They're equivalent data. However a maximum entropy distribution on one isn't a maximum entropy distribution on the other.
Pragmatically, there's always something you can do (e.g. a Jeffreys prior), but philosophically, this has always made me uneasy with justifications about max entropy that don't also have justifications of the choice of coordinate system.
Their joint distribution is degenerate, so it becomes a bit unclear how to even define entropy.
Personally I'm of the view that the Kullback-Leibler divergence which is defined for arbitrary probability measures (with no special treatment for continuous ones) and which is independent of the choice of coordinates is the true measure of information.
Its downside is that you can only compare 2 distributions that way. For the discrete case you can just pick the uniform distribution as your non-informative base. The issues with the entropy definition for continuous distributions boil down to the problem of picking a uniform distribution for the real numbers.
You're right, a proper extension of Shannon's entropy to the continuous case requires a reference measure. Jaynes discusses this in the Brandeis lectures I linked to in another comment: https://bayes.wustl.edu/etj/articles/brandeis.pdf
I see, of course in physics you have the advantage that the phase space already has a canonical volume form, which coincides with the normal uniform measure if you use canonical coordinates and, amazingly, is preserved by the equations of motion.
In most statistical problems you don't have such a nice measure, and it is always good to keep in mind that choosing a uniform 'improper prior', even implicitly, will mean your choice of coordinates will influence your result.
Edit: Hang on, I just noticed the names on the lecture notes you sent me. Uhlenbeck, Wheeler and Schwinger? That's one heck of a line-up. With the part you linked by Uhlenbeck it seems, I'm going to set aside some time to read that one more carefully.
The pdf I linked to contains only pages 181-218 from Vol III of the 1962 lectures (Jaynes). I’ve not read the others, I don’t know if they may be available online.
Thought experiment: suppose your friend drives 80 miles to visit you. They tell you the trip took between 2 and 4 hours. You have no further information. How confident are you the trip took less than 3 hours?
Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?
The principle of maximum entropy, applied to each, gives you different answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What gives? Which is the real way we should formulate travel times?
This paradox is a good motivator for when Bayesian probability is a useful. Your confidence is a posterior probability which is conditioned on some prior information. Initially you have little prior information, except for an interval of time and distance. When you receive information about the derivative of speed throughout the trip, this meaningfully updates your priors, and so the posterior changes.
The upshot here is that choosing the max entropy distribution as your prior isn't enough, you also need to choose some particular way to formulate the problem. Particular formulations (travel time vs. speed, here) imply different max entropy priors, even though the formulations are equivalent. Worse, there are infinite equivalent formulations, all with different implied max entropy priors.
You can get around this by choosing a non-max entropy prior, like [1], or by deciding on the One True Formulation for your problem. But (Bayesian) updating on the other formulations of the problem won't do it, because there isn't any information in the other formulations -- they're equivalent (by def).
Very cool! Thank you for sharing. We can get to these distributions a bunch of ways, and I find every incremental way to look at something, the better you understand it. Now, I’m about to get nerdsniped by symmetry and invariances.
With this method, you can derive all of statistical mechanics from information theory with constraints originated from thermodynamics. The observation of thermodynamic quantities, which are high level observations on particles (i.e. related to means, etc., and not to individual particles), puts constraints of the same kind as the ones listed in this article. This approach was pioneered by Jaynes (1952) "Information theory and statistical mechanics, I": https://www.semanticscholar.org/paper/Information-Theory-and...
> The Normal Distribution is your best guess if you only know the mean and the variance of your data.
That's awful advice for some domains. If your process dynamics are badly behaved (statistically), such as power laws and likes, it turns out the "mean" and "variance" you're calculating from samples are probably rubbish.
Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".
I’m making no statement on what your priors are, just that if you have the mean and variance, the max entropy distribution is the Normal. If you know skew and kurtosis, you’ll pick something else
"And if we weigh this by the probability of that particular event happening, we get info ∝ p ⋅ log2(1/p)"
I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?
The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the event is, the bigger the resulting number is, so in the case the event happens, the more "surprised" we are, and more information is gained. However, what do we achieve by weighing the bitcount of the information with the probability?
Ah, I think I got it. The point of the exercise is not to formulate the concept of "amount of suprise (∝ amount of information gained) IN CASE the event happens" but the "EXPECTED amount of entropy gain", for us to know before it happens.
That's why we need to take a middle ground between very common events that aren't surprising, and gain us hardly anything, and rare events that gain us a lot of information, but happen so rarely that they don't matter a whole lot.
The formula derived here manages to find the balance between these two extremes.
The weighting with probability turns it in to an expected value.
Remember that if x is a random variable, then its expected value is
E[x] = Sum_{All x} x p(x)
The interpretation is that E[x] is the average value of x. In particular, if we observe a bunch of x's one after the other, call them x_i, then the sample mean
S_n = (1/n) Sum_i x_i
which for a given "n" is random, converges to the deterministic constant E[x] as "n" gets large.
In this sense, the above formula for E[x] is "inevitable".
In the above case, the thing being averaged is "x", but the same holds true for x^2 ("what is the average value of x^2 ?"), or, in general, f(x) for any function f(.).
In this case, we're using the rather unusual function
f(x) = log_2(1/p(x)),
but the same intuition holds. It's the average number of bits needed to encode x, for instance.
Thanks. I was stuck with the idea that the whole argument was trying to formulate just the "amount of information gained | event happens", when it actually formulated the expected entropy gain.
My only advice is to end with a list of maximum entropy distributions to showcase the many applications of this theory. I often refer to such tables when I have varying constraints and want the best choice for representing the spread of the data.
This approach can mislead people because it by design make the hypothesis that the support is infinite and that the variance is finite, which is why it ends in a thin tail distribution in the first place.
Plus as said by klodolph the choice of arbitrarily restricting your knowledge to the mean and to the variance as summary statistics will lead to the Gaussian distribution. Moreover in practice restricting arbitrarily your knowledge is a violation of probability as a model of intuition as showed by Jaynes
> The Normal Distribution is your best guess if you only know the mean and the variance of your data.
This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is in order to get the normal distribution as the maximum entropy distribution.
The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of other distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.
So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.