Of course. As others here point out, the hypervolume inside an n-dimensional hypersphere grows as the nth power of a linear increase in radius. In high dimensions, tiny increases in radius cause hypervolume to grow by more than 100%. The concentration of hypervolume is always highest at the edge.[0]
The theoretical tools (and intuitions) we have today for making sense of the distribution of data, developed over the past three centuries, break down in high dimensions. The fact that in high dimensions Gaussian distributions are not "clouds" but actually "soap bubbles" is a perfect example of this breakdown. Can you imagine trying to model a cloud of high-dimensional points lying on or near a lower-dimensional manifold with soap bubbles?
If the data is not only high-dimensional but also non-linearly entangled, we don't yet have "mental tools" for reasoning about it:
More precisely: it is the mass that is “concentrated” at the edge, not the density. In the Gaussian case the distribution “gets more and more dense in the middle” regardless of the number of dimensions. However, in high dimensions the volume in the middle is so low that essentially all the mass is close to the surface of the hypersphere.
What i found quite surprising in that context, is that the volume of the n-dimensional ball for any finite fixed radius goes to zero as n goes to infinity (see, for example, section "high dimensions" of https://en.m.wikipedia.org/wiki/Volume_of_an_n-ball)
I was going to comment that what's going on here doesn't have much to do with the Gaussian distribution. In high dimensions, almost all of the volume of the unit ball is concentrated near the unit sphere. In the first comment, Frank Morgan makes the same remark, pointing out that you get the same effect with the uniform distribution on the unit cube in high dimensions.
Instead of plotting the cumulative mass distribution, why not just plot the mass distribution itself? I.e. the derivatives of the curves you've plotted?
In retrospect it's pretty trivial to see why it is so: it directly results from the basic fact that the hypervolume of an n-body grows as the nth power of linear size.
this is probably why such things like "average" person (or any other object with multiple characteristics) practically don't really exist. Everybody is [close to] exceptional in at least some of their characteristics or their combination :)
Isn't the unsuitability of the high-dimensional Gaussian intimately related to the fact that for most realistic problem spaces, we actually believe there are really far fewer than the N >> 1 measured dimensions?
A uniform Gaussian presupposes that the variates are either linearly orthogonal, or all have the same linear interaction with each other (in the case of fixed positive correlation).
If your actual problem has dimension 20, but you've measured it with N dimensions, then that means there are strong interactions between your measured variates, and moreover the intervariate interactions do not have a single fixed interaction strength (like a single Gaussian correlation), but probably vary like a random matrix.
This might be related to the Tracy-Widom[1] distribution somehow. Perhaps the the distribution you use to replace the Gaussian should really be something like: first generate a random positive semi-definite matrix as C, then generate random data based on different random choices of C.
I won't dispute the main point of the article but a couple minor errors bug me. First, he kept referring to a Gaussian distribution as being the unit sphere, when of course the radius depends upon the parameters of the Gaussian (the standard deviation). If not, then it wouldn't be invariant under which units you chose. A bizarre mistake to repeat many times throughout the article.
Less importantly, the last paragraph says that the probability that two samples are orthogonal is "very high". Being precisely orthogonal is technically a probability zero event. There author means "very close to orthogonal."
There was a good discussion about this problem in the context of Monte Carlo simulations in (1).
On that point my teeth were grinding because it assumes an identity covariance matrix. Ie the bubble needn't even be spherical.
The second is that the squared norm has a chisq distribution. There's no point simulating it. You can just plot the pdf, and have all kinds of facts about its mean, var, entropy etc. Also, iirc Shannon had something to say about this.
However, I do think these facts are worth a reminder.
I don't (on the first point). Everyone with the background to understand the problem under discussion and appreciate the explanation already understands that Gaussians are parametrized. I challenge you to find a counterexample. The specifics of non-isotropic parametrizations are even less relevant to the discussion than scalar parametrization.
On the second point, I agree that the approximation deserves a mention.
Those images [0] inputs that were optimised to maximise a certain classification response were cool! Instead of going to this peak of the response function, is there a way to explore the shell where the actual images reside? Would such images look, to our eyes, more like real input than the optimised input? I suspect they won't, but I still would like to see what is between the dogs!
They use a gamma distribution which has more probability density near the origin, which causes samples around the origin and interpolations to be more like real input.
In information theory, there's a reflected concept of a "typical set", which is the set of sequences of samples from a distribution whose probability (or probability density) is very close to the expected probability. If you draw a sequence of samples, you are overwhelmingly likely to get a typical outcome as opposed to, say, anything resembling the most likely outcome.
As a concrete example, if you have a coin that gets heads 99% of the time and you flip it 1M times, you are overwhelmingly likely to get around 10k tails, even though the individual sequences with many fewer tails are each far likelier than the typical sequences.
So I have a feeling that he's looking at the wrong histogram. If you plot the distributions of the vector magnitudes, you'll get a spike around whatever large number, and a very sharp falloff to the right and left.
However, it's not a "bubble" in the intuitive sense. He's looking at the magnitude distribution of dots over the entire space, and implicitly using the Cartesian coordinate system (discarded angle, looking at just magnitude).
If you look at the distribution of dots per volume (or R^N hyper-volume rather), then you'll still have the highest concentration in the center, with no "bubble".
Maybe it goes without saying but what I found distinctive about gaussian distribution in multiple dimensions is that it seems to be the only distribution which produces a smooth radial pattern when plotted co-linearly (yet not radially). All other distributions which I tested exhibit a bias through the main axis when just a number of (variateA,variateB) pairs are plotted. Gaussian seems to be the only one , fundamentally, which shows no sign of the orientation of the axis it is plotted along.
Comes in handy for plotting a radially smooth 'star cluster' without doing polar coordinates and trig. Just plot a load of (x=a_guass,y=another_gaus,z=another_gaus) and you have a radially smooth object. I dont think any other distribution can do that, it seems to me there is something mathematically profound about it which Im sure some mathemagicians have a proper grasp of.
The 'co-linear' distortions of other distributions can be seen here in some plots in the test page for my random distribution lib:
I was recently reading the section on importance sampling in David MacKay's "Information Theory, Learning, and Inference Algorithms". Page 373-376 in the linked pdf (http://www.inference.org.uk/itprnn/book.pdf)
He shows that importance sampling will likely fail in high dimensions precisely because samples from a high dimensional Gaussian can be very different than those from a uniform distribution on the unit sphere.
Consider the ratio between a sample at the same point from a 1000D Gaussian and a 1000D uniform distribution over a sphere. If you sample enough times, then the median ratio and the largest ratio will be different by a factor of 10^19. Basically, most samples from the Gaussian will be fairly similar to the uniform. A few will be wildly different.
Perhaps I'm misunderstanding both the post and MacKay's book. I'd be happy to be corrected.
If you sample from a 1000D Gaussian, most of the points will be "close" to the hypersphere of radius sqrt(1000). The distance to the center for 99.5% of the points is between 31.6-2 and 31.6+2. 99.997% will be between 31.6-3 and 31.6+3, 99.999997% will be between 31.6-4 and 31.6+4, etc.
This is what he means when he says "practically indistinguishable from uniform distributions on the [unit] sphere." As tgb remarked in another comment, the "unit" bit is incorrect.
Having trouble understanding a lot of the specifics of this--though broader concepts grok-able. Before I go blindly googling around to get up to speed--
Any recommended foundational texts to begin with?
Recommended learning trajectory to get to where this is understandable?
The theoretical tools (and intuitions) we have today for making sense of the distribution of data, developed over the past three centuries, break down in high dimensions. The fact that in high dimensions Gaussian distributions are not "clouds" but actually "soap bubbles" is a perfect example of this breakdown. Can you imagine trying to model a cloud of high-dimensional points lying on or near a lower-dimensional manifold with soap bubbles?
If the data is not only high-dimensional but also non-linearly entangled, we don't yet have "mental tools" for reasoning about it:
* https://medium.com/intuitionmachine/why-probability-theory-s...
* https://news.ycombinator.com/item?id=15620794
[0] See kgwgk's comment below.