Maximum Entropy Intuition for Fundamental Statistical Distributions

klodolph · on July 21, 2020

> Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is in order to get the normal distribution as the maximum entropy distribution.

The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of other distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.

So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.

Kednicma · on July 21, 2020

You are spot-on. I would just add that there's another relatively beautiful reason why, in practice, folks pick the mean and variance for their summary; it can be done online with live data, for O(1) time and space! [0] If we extend this idea, then we easily get kurtosis and skew as the next two moments, again in constant time and space, and again getting a normal distribution but now with skew and squish.

This is non-trivial; it means that we have an online algorithm which sends our measurements directly to our summaries, without worrying about how detailed our measurements are (how many samples are summarized). For comparison, taking the median/quantiles/percentiles/etc. requires either fixed-size buckets which lose precision (as seen in Prometheus histograms) or around O(log log n) space [1], which is still practical but pedantically not constant.

[0] https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...

[1] https://tmc.web.engr.illinois.edu/lb.ps

jbay808 · on July 21, 2020

> This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way.

No, it's not... A Gaussian is the best way to represent your knowledge of a value if you only know the mean and variance of its value.

So if you start with a stack of data and compress it down to a mean and variance, you've discarded most of your knowledge, and are left with a Gaussian as your best guess representation.

Yes if you were to boil it down to different summary data, like a max and min, you'd end up with a different state of knowledge and a different distribution.

But given a mean and variance, the Gaussian is your best choice of distribution, and not because of the central limit theorem, but because it has maximum entropy on those constraints. You don't always even have access to the source data in the first place, maybe just the summary statistics.

mturmon · on July 21, 2020

I would like to push back against this in favor of the original comment. The context of this remark within the article is:

> I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal. Can you guess why it’s everywhere?

This is not a "compression of data" question. It's not an "uninformed distributional choice" question.

It's a "why is this distribution prevalent in Nature" question.

In this context, I think the CLT gives a better answer. There are a lot of averaging processes in Nature, and due to the CLT, averaging of independent perturbations must give rise to normal distributions.

It's possible to perhaps go a step deeper than the above. In some physical systems, you can look at the second moment as an energy -- like the voltage-squared in electrical systems.

In this case, due to a-priori finiteness of system energy, the gaussian distribution can make a claim to being "inevitable" by the maxent argument in OP. ("In a system characterized by finite energy E, what is the least informative distributional constraint?")

lambdatronics · on July 21, 2020

Because additive processes are common. If the variables were multiplied together instead of summed, you'd get a different asymptotic distribution.

CrazyStat · on July 21, 2020

You'd get the lognormal, since multiplication is just the exponentiation of summation.

klodolph · on July 21, 2020

> But given a mean and variance, ...

The subject of my comment is, "Why are we given mean and variance?" If you take, "We are given mean and variance" as a presupposition, then you are having a different conversation.

The big problem with the maximum entropy argument is that if you apply some transformation to your data, you will end up with a different maximum entropy distribution. For example, you may choose to express your data in terms of rate (events / time) or period (time / events). Maximum entropy won't help you here, you have to have some kind of theoretical understanding of the underlying process that justifies your choice.

The same is true for normal distributions and mean / variance, but it's such an "obvious" choice that people forget to justify their models. My experience is that the premise of the CLT is much easier to justify, and you can use that to support your use of the normal distribution.

conjectures · on July 21, 2020

> if you only know the mean and variance of its value.

And why are they the only things you know? Most often, because they are the things you asked for previously... rather than, as OP hinted, e.g. some facts about quantiles.

yetanothermonk · on July 21, 2020

Very good points. The Normal Distribution has very nice properties, which adds to its popularity, which, if I understand you correctly, adds to the significance we place on mean and variance.

adenadel · on July 21, 2020

I wouldn't say that the existence of the normal distribution is necessarily the reason that we place significance on the mean and variance. The mean is incredibly natural to define when you move to measure theoretic probability (simply the integral of a function, i.e. a random variable, with respect to some measure). When you take this point of view, random variables with particular moments existing are simply functions in L^p. Further, when you move on to proving the CLT the moments give you properties of the characteristic functions that allow you to prove the CLT. These are all deep connections.

There's another interpretation of the mean (and conditional expectation) as quantities minimizing squared error. It's not surprising that squared error and variance are so similar and that these are connected.

yetanothermonk · on July 21, 2020

Great comment! You touched on something that got me wondering: do you know of something similar to MGF that determines a distribution but is not based on expected values?

clircle · on July 21, 2020

MGFs do not determine distributions (The Student's t distribution does not have an MGF).

CDFs and characteristic functions determine distributions.

adenadel · on July 21, 2020

Well, characteristic functions are more general than MGFs (in that they are always finite) and have very useful properties, but they have a similar definition and are also based on an expectation.

This may be pedantic, but an object that determines a distribution that isn't based on expected values is the CDF. In my first probability course in undergrad I think we defined probability mass functions and probability density functions first and defined CDFs in terms of them, but from a measure theoretic point of view, the CDF is more fundamental since it is defined for continuous and discrete distributions (and also for distributions that are neither).

contravariant · on July 21, 2020

I'd probably choose the probability measure itself, rather than the CDF, just because you won't always have a space with a nice order topology.

Of course the expectation value and the probability measure are really two sides of the same coin, the expectation value being equivalent to integration with respect to the probability measure.

ianhorn · on July 21, 2020

One thing I'd add to this is that this kind of thinking makes your coordinate system really matter.

Consider a measurement of some uncertainly sized cubes. You could describe them with their edge length or their volume. Learning one tells you the other. They're equivalent data. However a maximum entropy distribution on one isn't a maximum entropy distribution on the other.

Pragmatically, there's always something you can do (e.g. a Jeffreys prior), but philosophically, this has always made me uneasy with justifications about max entropy that don't also have justifications of the choice of coordinate system.

canjobear · on July 21, 2020

This seems important. Is there somewhere where this example is worked out in more detail?

lambdatronics · on July 21, 2020

http://bayes.wustl.edu/etj/articles/prior.pdf

yetanothermonk · on July 21, 2020

Thank you!

blackbear_ · on July 21, 2020

Could this be solved by putting a max-entropy distribution on their joint probability?

contravariant · on July 21, 2020

Their joint distribution is degenerate, so it becomes a bit unclear how to even define entropy.

Personally I'm of the view that the Kullback-Leibler divergence which is defined for arbitrary probability measures (with no special treatment for continuous ones) and which is independent of the choice of coordinates is the true measure of information.

Its downside is that you can only compare 2 distributions that way. For the discrete case you can just pick the uniform distribution as your non-informative base. The issues with the entropy definition for continuous distributions boil down to the problem of picking a uniform distribution for the real numbers.

kgwgk · on July 21, 2020

You're right, a proper extension of Shannon's entropy to the continuous case requires a reference measure. Jaynes discusses this in the Brandeis lectures I linked to in another comment: https://bayes.wustl.edu/etj/articles/brandeis.pdf

contravariant · on July 21, 2020

I see, of course in physics you have the advantage that the phase space already has a canonical volume form, which coincides with the normal uniform measure if you use canonical coordinates and, amazingly, is preserved by the equations of motion.

In most statistical problems you don't have such a nice measure, and it is always good to keep in mind that choosing a uniform 'improper prior', even implicitly, will mean your choice of coordinates will influence your result.

Edit: Hang on, I just noticed the names on the lecture notes you sent me. Uhlenbeck, Wheeler and Schwinger? That's one heck of a line-up. With the part you linked by Uhlenbeck it seems, I'm going to set aside some time to read that one more carefully.

kgwgk · on July 21, 2020

The pdf I linked to contains only pages 181-218 from Vol III of the 1962 lectures (Jaynes). I’ve not read the others, I don’t know if they may be available online.

jmoss20 · on July 21, 2020

Thought experiment: suppose your friend drives 80 miles to visit you. They tell you the trip took between 2 and 4 hours. You have no further information. How confident are you the trip took less than 3 hours?

Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?

The principle of maximum entropy, applied to each, gives you different answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What gives? Which is the real way we should formulate travel times?

See: https://en.wikipedia.org/wiki/Bertrand_paradox_(probability) Credit for this example: Michael Titelbaum

lambdatronics · on July 21, 2020

https://bayes.wustl.edu/etj/articles/ambiguity.pdf

fractionalhare · on July 21, 2020

This paradox is a good motivator for when Bayesian probability is a useful. Your confidence is a posterior probability which is conditioned on some prior information. Initially you have little prior information, except for an interval of time and distance. When you receive information about the derivative of speed throughout the trip, this meaningfully updates your priors, and so the posterior changes.

jmoss20 · on July 21, 2020

The upshot here is that choosing the max entropy distribution as your prior isn't enough, you also need to choose some particular way to formulate the problem. Particular formulations (travel time vs. speed, here) imply different max entropy priors, even though the formulations are equivalent. Worse, there are infinite equivalent formulations, all with different implied max entropy priors.

You can get around this by choosing a non-max entropy prior, like [1], or by deciding on the One True Formulation for your problem. But (Bayesian) updating on the other formulations of the problem won't do it, because there isn't any information in the other formulations -- they're equivalent (by def).

[1]: https://en.wikipedia.org/wiki/Jeffreys_prior

cttet · on July 21, 2020

There are more information provided when you mentioned "they maintained a constant speed throughout the trip".

jmoss20 · on July 21, 2020

How would you update on it?

(But you could just as well s/constant speed/average speed/, I don't think it makes a difference.)

canjobear · on July 21, 2020

You can derive these distributions with a lot less algebra by characterizing them with invariances, rather than maximum entropy under constraints.

https://stevefrank.org/reprints-pdf/16Entropy.pdf

yetanothermonk · on July 21, 2020

Very cool! Thank you for sharing. We can get to these distributions a bunch of ways, and I find every incremental way to look at something, the better you understand it. Now, I’m about to get nerdsniped by symmetry and invariances.

yetanothermonk · on July 27, 2020

What’s the next best step or resource for learning about symmetry & invariances?

martopix · on July 21, 2020

With this method, you can derive all of statistical mechanics from information theory with constraints originated from thermodynamics. The observation of thermodynamic quantities, which are high level observations on particles (i.e. related to means, etc., and not to individual particles), puts constraints of the same kind as the ones listed in this article. This approach was pioneered by Jaynes (1952) "Information theory and statistical mechanics, I": https://www.semanticscholar.org/paper/Information-Theory-and...

kgwgk · on July 21, 2020

> This approach was pioneered by Jaynes (1952)

minor correction: 1957

This is a more detailed introduction to the subject (from 1962): https://bayes.wustl.edu/etj/articles/brandeis.pdf

carlosf · on July 21, 2020

> The Normal Distribution is your best guess if you only know the mean and the variance of your data.

That's awful advice for some domains. If your process dynamics are badly behaved (statistically), such as power laws and likes, it turns out the "mean" and "variance" you're calculating from samples are probably rubbish.

Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".

yetanothermonk · on July 27, 2020

I’m making no statement on what your priors are, just that if you have the mean and variance, the max entropy distribution is the Normal. If you know skew and kurtosis, you’ll pick something else

GolDDranks · on July 21, 2020

"And if we weigh this by the probability of that particular event happening, we get info ∝ p ⋅ log2(1/p)"

I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?

The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the event is, the bigger the resulting number is, so in the case the event happens, the more "surprised" we are, and more information is gained. However, what do we achieve by weighing the bitcount of the information with the probability?

GolDDranks · on July 21, 2020

Ah, I think I got it. The point of the exercise is not to formulate the concept of "amount of suprise (∝ amount of information gained) IN CASE the event happens" but the "EXPECTED amount of entropy gain", for us to know before it happens.

That's why we need to take a middle ground between very common events that aren't surprising, and gain us hardly anything, and rare events that gain us a lot of information, but happen so rarely that they don't matter a whole lot.

The formula derived here manages to find the balance between these two extremes.

mturmon · on July 21, 2020

I would agree w/ your statement here. The entropy is the on-average (or, expected) amount of information gained from seeing one "x".

mturmon · on July 21, 2020

The weighting with probability turns it in to an expected value.

Remember that if x is a random variable, then its expected value is

  E[x] = Sum_{All x} x p(x)

The interpretation is that E[x] is the average value of x. In particular, if we observe a bunch of x's one after the other, call them x_i, then the sample mean

  S_n = (1/n) Sum_i x_i

which for a given "n" is random, converges to the deterministic constant E[x] as "n" gets large.

In this sense, the above formula for E[x] is "inevitable".

In the above case, the thing being averaged is "x", but the same holds true for x^2 ("what is the average value of x^2 ?"), or, in general, f(x) for any function f(.).

In this case, we're using the rather unusual function

  f(x) = log_2(1/p(x)),

but the same intuition holds. It's the average number of bits needed to encode x, for instance.

GolDDranks · on July 21, 2020

Thanks. I was stuck with the idea that the whole argument was trying to formulate just the "amount of information gained | event happens", when it actually formulated the expected entropy gain.

jostmey · on July 21, 2020

I love the article!

My only advice is to end with a list of maximum entropy distributions to showcase the many applications of this theory. I often refer to such tables when I have varying constraints and want the best choice for representing the spread of the data.

See the table in https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...

yetanothermonk · on July 21, 2020

Thank you very much! Great idea!!

Nesco · on July 21, 2020

This approach can mislead people because it by design make the hypothesis that the support is infinite and that the variance is finite, which is why it ends in a thin tail distribution in the first place.

Plus as said by klodolph the choice of arbitrarily restricting your knowledge to the mean and to the variance as summary statistics will lead to the Gaussian distribution. Moreover in practice restricting arbitrarily your knowledge is a violation of probability as a model of intuition as showed by Jaynes

abelaer · on July 21, 2020

Logistic regression can actually also be interpreted as a maximum entropy distribution after observing some 'training data'.