Gaussian Distributions Are Soap Bubbles

cs702 · on Nov 11, 2017

Of course. As others here point out, the hypervolume inside an n-dimensional hypersphere grows as the nth power of a linear increase in radius. In high dimensions, tiny increases in radius cause hypervolume to grow by more than 100%. The concentration of hypervolume is always highest at the edge.[0]

The theoretical tools (and intuitions) we have today for making sense of the distribution of data, developed over the past three centuries, break down in high dimensions. The fact that in high dimensions Gaussian distributions are not "clouds" but actually "soap bubbles" is a perfect example of this breakdown. Can you imagine trying to model a cloud of high-dimensional points lying on or near a lower-dimensional manifold with soap bubbles?

If the data is not only high-dimensional but also non-linearly entangled, we don't yet have "mental tools" for reasoning about it:

* https://medium.com/intuitionmachine/why-probability-theory-s...

* https://news.ycombinator.com/item?id=15620794

[0] See kgwgk's comment below.

kgwgk · on Nov 11, 2017

> Density is always highest at the edge.

More precisely: it is the mass that is “concentrated” at the edge, not the density. In the Gaussian case the distribution “gets more and more dense in the middle” regardless of the number of dimensions. However, in high dimensions the volume in the middle is so low that essentially all the mass is close to the surface of the hypersphere.

ks1723 · on Nov 11, 2017

What i found quite surprising in that context, is that the volume of the n-dimensional ball for any finite fixed radius goes to zero as n goes to infinity (see, for example, section "high dimensions" of https://en.m.wikipedia.org/wiki/Volume_of_an_n-ball)

cs702 · on Nov 11, 2017

There are multiple good (intuitive) explanations for this here: https://math.stackexchange.com/questions/67039/why-does-volu...

cs702 · on Nov 11, 2017

Yes, you're right. For clarity's sake, I've updated my comment.

rsp1984 · on Nov 11, 2017

Thanks for this comment as it manages to explain something non-trivial in a very concise way, which is not something I find often.

woopwoop · on Nov 11, 2017

I was going to comment that what's going on here doesn't have much to do with the Gaussian distribution. In high dimensions, almost all of the volume of the unit ball is concentrated near the unit sphere. In the first comment, Frank Morgan makes the same remark, pointing out that you get the same effect with the uniform distribution on the unit cube in high dimensions.

High dimensions are weird.

paulgb · on Nov 11, 2017

It's true! I just made a colab notebook to visualize the effect: https://colab.research.google.com/notebook#fileId=1znFHwemxa...

(seems to require a Google account, sorry in advance)

semi-extrinsic · on Nov 11, 2017

Instead of plotting the cumulative mass distribution, why not just plot the mass distribution itself? I.e. the derivatives of the curves you've plotted?

tomas_fiers · on Nov 11, 2017

Great, thanks!

Sharlin · on Nov 11, 2017

In retrospect it's pretty trivial to see why it is so: it directly results from the basic fact that the hypervolume of an n-body grows as the nth power of linear size.

dwaltrip · on Nov 11, 2017

How much of this stuff has fundamental implications for the hardest problems that scientists, researchers, and policy makers are facing today?

My intuition says many of those problems have high dimensionality, but I'm not really confident about my intuition here.

trhway · on Nov 15, 2017

this is probably why such things like "average" person (or any other object with multiple characteristics) practically don't really exist. Everybody is [close to] exceptional in at least some of their characteristics or their combination :)

smallnamespace · on Nov 11, 2017

Isn't the unsuitability of the high-dimensional Gaussian intimately related to the fact that for most realistic problem spaces, we actually believe there are really far fewer than the N >> 1 measured dimensions?

A uniform Gaussian presupposes that the variates are either linearly orthogonal, or all have the same linear interaction with each other (in the case of fixed positive correlation).

If your actual problem has dimension 20, but you've measured it with N dimensions, then that means there are strong interactions between your measured variates, and moreover the intervariate interactions do not have a single fixed interaction strength (like a single Gaussian correlation), but probably vary like a random matrix.

This might be related to the Tracy-Widom[1] distribution somehow. Perhaps the the distribution you use to replace the Gaussian should really be something like: first generate a random positive semi-definite matrix as C, then generate random data based on different random choices of C.

[1] https://en.wikipedia.org/wiki/Tracy%E2%80%93Widom_distributi...

tgb · on Nov 11, 2017

I won't dispute the main point of the article but a couple minor errors bug me. First, he kept referring to a Gaussian distribution as being the unit sphere, when of course the radius depends upon the parameters of the Gaussian (the standard deviation). If not, then it wouldn't be invariant under which units you chose. A bizarre mistake to repeat many times throughout the article.

Less importantly, the last paragraph says that the probability that two samples are orthogonal is "very high". Being precisely orthogonal is technically a probability zero event. There author means "very close to orthogonal."

There was a good discussion about this problem in the context of Monte Carlo simulations in (1).

(1) https://arxiv.org/abs/1701.02434

conjectures · on Nov 11, 2017

On that point my teeth were grinding because it assumes an identity covariance matrix. Ie the bubble needn't even be spherical.

The second is that the squared norm has a chisq distribution. There's no point simulating it. You can just plot the pdf, and have all kinds of facts about its mean, var, entropy etc. Also, iirc Shannon had something to say about this.

However, I do think these facts are worth a reminder.

jjoonathan · on Nov 11, 2017

I don't (on the first point). Everyone with the background to understand the problem under discussion and appreciate the explanation already understands that Gaussians are parametrized. I challenge you to find a counterexample. The specifics of non-isotropic parametrizations are even less relevant to the discussion than scalar parametrization.

On the second point, I agree that the approximation deserves a mention.

kgwgk · on Nov 11, 2017

> There was a good discussion about this problem in the context of Monte Carlo simulations in (1).

There was some discussion here about that paper a few months ago:

https://news.ycombinator.com/item?id=13750621

tgb · on Nov 12, 2017

In fact that's where I found it from!

Bromskloss · on Nov 11, 2017

Those images [0] inputs that were optimised to maximise a certain classification response were cool! Instead of going to this peak of the response function, is there a way to explore the shell where the actual images reside? Would such images look, to our eyes, more like real input than the optimised input? I suspect they won't, but I still would like to see what is between the dogs!

[0] http://www.inference.vc/content/images/2017/11/Screen-Shot-2...

lebek · on Nov 11, 2017

This paper is sort of about that: https://arxiv.org/abs/1710.11381

They use a gamma distribution which has more probability density near the origin, which causes samples around the origin and interpolations to be more like real input.

tobr · on Nov 11, 2017

The source is a great read: https://distill.pub/2017/feature-visualization/

snippyhollow · on Nov 11, 2017

Compulsory "Spikey Spheres" notebook http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/...

cs702 · on Nov 11, 2017

Yes. t-SNE is probably the only algorithm that might produce "sensible" 2D mappings.

BTW, matplotlib has a nicer facility than add_subplot() for making grid plots:

  fig, axes = plot.subplots(nrows=figdims, ncols=figdims)
  for dim, ax1 in zip(range(2, MAX_DIM), axes.flatten()[:(MAXDIM-2)]):
  .
  .
  .

amluto · on Nov 11, 2017

In information theory, there's a reflected concept of a "typical set", which is the set of sequences of samples from a distribution whose probability (or probability density) is very close to the expected probability. If you draw a sequence of samples, you are overwhelmingly likely to get a typical outcome as opposed to, say, anything resembling the most likely outcome.

As a concrete example, if you have a coin that gets heads 99% of the time and you flip it 1M times, you are overwhelmingly likely to get around 10k tails, even though the individual sequences with many fewer tails are each far likelier than the typical sequences.

Scene_Cast2 · on Nov 12, 2017

So I have a feeling that he's looking at the wrong histogram. If you plot the distributions of the vector magnitudes, you'll get a spike around whatever large number, and a very sharp falloff to the right and left.

However, it's not a "bubble" in the intuitive sense. He's looking at the magnitude distribution of dots over the entire space, and implicitly using the Cartesian coordinate system (discarded angle, looking at just magnitude).

If you look at the distribution of dots per volume (or R^N hyper-volume rather), then you'll still have the highest concentration in the center, with no "bubble".

strainer · on Nov 12, 2017

Maybe it goes without saying but what I found distinctive about gaussian distribution in multiple dimensions is that it seems to be the only distribution which produces a smooth radial pattern when plotted co-linearly (yet not radially). All other distributions which I tested exhibit a bias through the main axis when just a number of (variateA,variateB) pairs are plotted. Gaussian seems to be the only one , fundamentally, which shows no sign of the orientation of the axis it is plotted along.

Comes in handy for plotting a radially smooth 'star cluster' without doing polar coordinates and trig. Just plot a load of (x=a_guass,y=another_gaus,z=another_gaus) and you have a radially smooth object. I dont think any other distribution can do that, it seems to me there is something mathematically profound about it which Im sure some mathemagicians have a proper grasp of.

The 'co-linear' distortions of other distributions can be seen here in some plots in the test page for my random distribution lib:

http://strainer.github.io/Fdrandom.js/

bglazer · on Nov 11, 2017

I was recently reading the section on importance sampling in David MacKay's "Information Theory, Learning, and Inference Algorithms". Page 373-376 in the linked pdf (http://www.inference.org.uk/itprnn/book.pdf)

He shows that importance sampling will likely fail in high dimensions precisely because samples from a high dimensional Gaussian can be very different than those from a uniform distribution on the unit sphere.

Consider the ratio between a sample at the same point from a 1000D Gaussian and a 1000D uniform distribution over a sphere. If you sample enough times, then the median ratio and the largest ratio will be different by a factor of 10^19. Basically, most samples from the Gaussian will be fairly similar to the uniform. A few will be wildly different.

Perhaps I'm misunderstanding both the post and MacKay's book. I'd be happy to be corrected.

kgwgk · on Nov 11, 2017

If you sample from a 1000D Gaussian, most of the points will be "close" to the hypersphere of radius sqrt(1000). The distance to the center for 99.5% of the points is between 31.6-2 and 31.6+2. 99.997% will be between 31.6-3 and 31.6+3, 99.999997% will be between 31.6-4 and 31.6+4, etc.

This is what he means when he says "practically indistinguishable from uniform distributions on the [unit] sphere." As tgb remarked in another comment, the "unit" bit is incorrect.

srs70187 · on Nov 11, 2017

This is an interesting take and kudos to the author for relaying a helpful way to think about high dimensional distributions.

I really like and often come back to this talk by Michael Betancourt where the theme is quite similar: https://youtu.be/pHsuIaPbNbY

andrewflnr · on Nov 12, 2017

This reminds me of the story summed up in this quote:

  There was no such thing as an average pilot. If you’ve designed a cockpit to fit
  the average pilot, you’ve actually designed it to fit no one.

Good enough source here: http://wmbriggs.com/post/18291/

Humans form a very high-dimensional space. I'm not sure what to make of the point about orthogonality in that regard.

brianjoseff · on Nov 12, 2017

Having trouble understanding a lot of the specifics of this--though broader concepts grok-able. Before I go blindly googling around to get up to speed--

Any recommended foundational texts to begin with?

Recommended learning trajectory to get to where this is understandable?

m3kw9 · on Nov 11, 2017

Can we derive some type of optimization algorithms from using this?