Hacker News new | past | comments | ask | show | jobs | submit login
An idea from physics helps AI see in higher dimensions (quantamagazine.org)
230 points by theafh on Jan 9, 2020 | hide | past | favorite | 75 comments



Pretty amazing work -- a couple of thoughts:

1) The article doesn't say this, but dimensions don't always have to do with locations in space and time, you can treat any value that can continuously vary as a dimension -- for example, a person might have dimensions for personality type, age, hair color, etc, etc.. Seems like they could use this technique to better train CNNs to recognize patterns in a lot of data besides imagery -- fraud detection based on credit card transactions, for example. 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?


The article is talking specifically about performing convolutions on higher dimensional manifolds. This is different from the more broad concept of data dimensionality typically associated with AI/ML.

Without repeating the article too much, this is important because it can be used to learn very complex systems from a series of lower dimensional projections. Such as, creating a 3d map of a dog from a collection of 2d images of dogs. The resulting system can better detect a dog in a position it's never seen because the CNN has built into it the relationship between 3d space and the 2d representation of that space.


This is a serious question because I don’t know: what’s the difference between high dimensional data and high dimensional manifolds?


This isn't a complete answer, but consider a series of data points describing a line. E.g., (0,0,0,0), (1,1,1,1), (2,2,2,2), .... It's possible to have lines even in any n-dimensional space, but the data is "high dimensional" -- you're using a large number of variables to describe a fairly simple process.

Contrast that example with any sufficiently large, uniform sampling of the n-dimensional unit cube. The data is inherently high dimensional, and any attempt (attempts from a certain class of allowed methods -- no need to burden ourselves with the details) to reduce from n to k<n dimensions will throw away some important information about the structure of the data.

Interestingly, it's very possible to have a low dimensional representation of a high dimensional process. One of the other comments mentioned the example of a photograph representing a 3D scene (also containing a week, implicit view into some other variables like temperature). That transformation from 3D->2D is inherently lossy, but for some kinds of problems a finite sampling of a low dimensional representation allows you to uniquely reconstruct a high dimensional data representation. I haven't read the article yet, but the other comments seem to indicate something of that flavor happening here.


Disclaimer: This is reaching the limits of my math abilities.

The differences comes in the fact that higher dimensional tuples may contain data that is independent of other fields. Say, if you have a tuple that's: (name, dob, address,) and you have a projection function that accepts such a 3-tuple and returns a 2-tuple of (name, dob,). For that function, the address dimension has no relationship at all to the other fields, meaning, that there is not unproject function that a person could create such that 3_tuple == unproject(project(3_tuple)).

With manifolds, the higher dimensions can have a relationship with lower dimensional data, and such a relationship can be encoded into a function. What the research appear to have designed is a system that, given a priori knowledge of task and enough n-tuples for learning, can produce and approximation of such an unproject function.

Thus, after learning, they have a system where, unproject(project(n_tuple)) ~= n+1_tuple. Because they were able to inform the learning system about the nature of the relationship between the two dimensions.


s/tuple/coordinate


> 2) There are a lot of local and global symmetries in physics -- i wonder what new capabilities adding them to a CNN would enable?

It would be trivial to make any ML model satisfy dimensional homogeneity. Just use only dimensionless variables consistent with the Buckingham Pi theorem. Other symmetries would probably have to be baked in from the start.

As I recall some engineers are developing ML-like models that satisfy all sorts of physical constraints under the name "model order reduction".


Excerpt:

"Now, researchers have delivered, with a new theoretical framework for building neural networks that can learn patterns on any kind of geometric surface. These “gauge-equivariant convolutional neural networks,” or gauge CNNs, developed at the University of Amsterdam and Qualcomm AI Research by Taco Cohen, Maurice Weiler, Berkay Kicanaoglu and Max Welling, can detect patterns not only in 2D arrays of pixels, but also on spheres and asymmetrically curved objects. “This framework is a fairly definitive answer to this problem of deep learning on curved surfaces,” Welling said."


Can someone explain to me why advances in actual model performance come from using analogies from physics when there are papers that supposedly provide a mathematical explanation of convolution?

"A Mathematical Theory of Deep ConvolutionalNeural Networks for Feature Extraction":

https://arxiv.org/pdf/1512.06293.pdf

"Understanding Convolutional Neural Networks with A Mathematical Model":

https://arxiv.org/pdf/1609.04112.pdf


Because it's not a standard convolutional net by the description. The difference is:

A) Studying an existing technique with math.

B) Coming up with a new technique.

You could get a modern engineering consultancy to review your steam engine, but it would still be a steam engine.


Is multidimensional AI the new 2020 buzzword?


I sure hope so. Telling people I build multidimensional data structures for a living has only yielded glazed-over eyes thus far.


- I'm a developer.

- Huh ?

- I build websites types on an air keyboard.


"Developer" is a word that requires some context. A stranger might not know if you worked in construction or on computers.


On a professional context I'll say I'm a software developer. If I want to brag a bit I'll say I'm a software engineer. If I'm with friends I'll say I'm a programmer. If I'm with family or older people I'll dip my toes with "I work with computers" and maybe further explain if prompted.


>and maybe further explain if prompted.

I write manuals for computers.

"So, like, for how to use them?"

No, the computer reads it so it knows what to do.


I just say Software Engineer its what my employer calls me and resolves what I want to call myself.


This is much more entertaining when you have family who are PEs and their eye twitches every time.


Which is funny because there's also backlash against Computer Science that it is not in fact a science, which is why they added math courses to some degrees, at least here in Florida. I think Computer Programming is just probably best described as Computer Programming but I do like saying Software Engineer every time someone asks just because it sounds good enough to me. Until they standardize our title into one single thing, I'll just go by SE.


Funny, where I come from the earliest computer science departments at universities where basically joint ventures of the maths and electric engineering departments. Consequently they were quite maths heavy, and it shows to this day.


Holy shit, are you me? I do the exact same thing. I literally say "I build websites" while typing on an air keyboard.


Sign language (BSL) uses air keyboard for "programmer": https://www.signbsl.com/sign/computer-programmer


Glazed over in awe or boredom?


Like those little question marks that would appear over AI that was "thinking" in 90's games.


This is super cool and I'm pretty sure this is basically topology. The article was pretty hard to read though. It reminds me a little bit of the https://en.m.wikipedia.org/wiki/Hairy_ball_theorem


Not really topology, it's more like group theory, and representations.

Ordinary convnets are a way of building in translational symmetry, which is the group R^2 (in the plane). The work being described extends this to larger symmetry groups, such as rotations of a molecule in 3D (which is SO(3)).

For either of these, you can work in Fourier space instead of real space, where convolutions become products. For ordinary convnets means ordinary FFT, but nobody does that as translating to neighbouring pixels is simple enough. Rotations aren't so simple, and so working in Fourier space can be an efficient way to do things. And the connection to physics is really just that the representation theory of SO(3) is a bread-and-butter exercise there, the basis of atomic theory.


What happens if there are multiple 3D or 4D objects? Do we then need an attention mechanism as well? Or is there some topology where a "where vs what" pathway emerges naturally?


Isn't a 3D object basically a 4D's object surface?


The only valid answer to your question is "probably"


Is this very different from a graph convolutional network (GCN)? Seems like a GCN would have a lot of the same equivariabne properties (i.e. orientation, units of measure, etc)?


_I_ would like to have a VR experience in higher dimensions. It should not be completely impossible to build some kind of actuators that I can somehow attach to my body to sense my orientation and acceleration in fourth dimension.


May I recommend 4D Toys? It's made by the same guy who's developing Miegakure and it has a VR version. The controls are a bit limited though in that user-initiated rotations are limited to 3D.


I continue to be unceasingly disappointed by his failure to release Miegakure after all these years.


I thought that is me being stuck in 3d and the toys around me moving in 4d? As I would like to move myself in 4d...


There's a slider to move yourself in the extra dimension.


Something that could go by the same title is the use of tensor networks for ML. I think it works like a pre-optimization step by dimension reduction of the solution space, but if someone could give the right intuitive explanation I'd be much obliged.

It seems to be a way to lessen inductive bias by making decisions about available ML algos. That is, it vastly increases the solution space but remains effective by omitting unlikely solutions.


This is a really interesting application of differential geometry in machine learning! And the allusion at the end to having the system eventually learn the symmetries of the system and make sure of that is really interesting. All the examples they gave were very physical, like climate models, but I imagine you could find symmetries in much more abstract problems that may not be intuitive.


Related G-CNN video: https://youtu.be/wZWn7Hm8osA


Hype pipeline: 1) Take a banal feature engineering work. 2) Add Albert Einstein reference. 3) Profit.


One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.

This is how distributed training often works, for example. Data parallelism.

I don’t understand why it still works in higher dimensions, but it seems to.

The intuition is that the multiple models are “spinning” around the true solution, so averaging gives the final result more quickly. But it works even early in the training process.


When we use data parallelism, we're summing/averaging the gradients induced by different parts of the data, not the weights of the model itself. When using multiple models for ensemble methods, we're summing/averaging the output of the models for a sample. We're not summing/averaging the weights of the models themselves. While averaging the weights of several models might work on a given problem, it definitely doesn't work in general.

w2tanh(w1x) = -1w2tanh(-w1*x)

But if you average the weights in those two equivalent models you get 0.

If you're talking about asynchronous data parallelism, then there can be some averaging of weights, but they all start with the same weights and are re-synched often enough that weights are never too different to break it.


https://www.docdroid.net/faDq8Bu/swarm-training-v01a.pdf

We average the weights themselves, and the efficiency seems to be similar to gradient gathering.

It’s also averaging in slices, not the full model. There’s never a full resync.

SWA is the theoretical basis for why it works, I think.

Another way of thinking about it: If the gradients can be averaged, then so can the weights.


If you are averaging weights often enough, then it's basically the same as averaging gradients. If you average the weights of a bunch of independently-trained models, you're going to have a rough time. Even if the function computes the exact same thing, the order of rows and columns in the intermediate matrices will totally ruin your averaging strategy.


Concentration of measure. If you have a quantity that depends on a large number of random variables but not too strongly on any small subset of them, it tends to behave like a constant. That's the intuition behind the law of large numbers, the central limit theorem, a bunch of concentration inequalities, and model averaging.


And, of course, there is nothing that says this would work for intermediate layers, since the sample dimensions may get there from any input.

What works is averaging similar networks and averaging your networks a lot of times.


Are you familiar with a paper called Synergy of Monotonic Rules?

http://jmlr.csail.mit.edu/papers/volume17/16-137/16-137.pdf

The math there is above my paygrade, but it describes a way to structurally combine a certain class of ML models for very significant performance gains. More importantly, it describes why it works.


You are averaging weights in distributed training? That seems like it would be rife with pitfalls unless you average after every batch.

I always thought the preferred method was to average the gradient updates, and pass that to update the single mother-model.


I believe it works in distributed training because the models never have time to diverge far enough to be "incompatible".


Isn't this sort of like "the wisdom of crowds"?

https://en.wikipedia.org/wiki/The_Wisdom_of_Crowds


Even stranger is that such a misleading/false claim is upvoted here.


I tried averaging the weights from a bunch of differently-trained neural networks for a Loss.jpg detector once.

If you think about what sort of brain you'd get if you "averaged" a few hundred geniuses' brains by blending them into a soup and adding some gelatin to a brain-sized sample, yeah, that's about the level of intelligence I saw in the "average neural net." It was basically in a constant state of seizure.


> One of the strangest things in AI, to me, is that you can average the weights of multiple different models to create a single model that’s better than any of the individuals.

I think this is one of the least strange things in AI. All you're doing is taking N overfitted models (unlikely to be overfit in the same way) and then asserting that the average of those predictions is probably not overfitted as much (regularization). Overfitting as a concept is not restricted to some number of dimensions.


That would be averaging the predictions, not averaging the weights.

It doesn't really matter though, because the parent comment is nonsense and you can't just average the weights at the end and get a working neural net.


I just assumed he meant averaging the predictions but said it wrong.


The weights arent averaged unless the training step is synchronous. Even then most of the times it is the gradients that are added up rather than the actual weights.

For inference, I dont think there are many papers that claim direct average of weights perform better than any single model. It is usually the output that is accumulated in some way.


https://arxiv.org/abs/1803.05407

"Averaging Weights Leads to Wider Optima and Better Generalization"

Weirdly, the averaging doesn't have to be synchronous.


As others commented, you either ensemble models (average the predictions) or average the update (gradient).

For ensembling, the mathematical justification for why this surprising result is true (e.g just averaging many weak but different models gives a better model) is pretty interesting: https://en.wikipedia.org/wiki/Condorcet%27s_jury_theorem


Works for humans as well (wisdom of crowds).


The claim as stated is not true, and would be equivalent to taking the average brain and it being as good as the average answer from those people. Nonsense.


This article is all over the place with almost no substance. It starts talking about Einstein's theory of relativity, it says this definitively solves curved data and gives almost no insight into what is actually different. It is so bad it makes me want to avoid this site all together.


The article includes multiple links to the related underlying research papers, for people who need more substance.


My point is that the article is worse than just having a link to the paper. It just says it might lead to cures for diseases, better self driving cars and lots of other abstract sensational nonsense. It could be used as a template and any new machine learning paper could just be linked to it.


If you think gauge invariance on Riemannian manifolds is an empty topic with no substance, you might not want to be working in machine learning.


Did you read my comment before writing this?


I thought quantamagazine was above publishing click-bait, but I guess not.

Neural networks already see in "higher dimensions" (whatever that means). Anyone who's ever used neural networks already knows each neuron's branch (i.e. dendrite) of an N-sized vector can already be though of as a "dimension" of a data set. CNN (convolutions) flatten that data (reduce it or seeing the same pattern over less "dendrites", much like PCA, etc.).

CNNs only make sense when working with image data anyways.


> CNNs only make sense when working with image data anyways

Not true, N-dimensional convnets, 1-d convnets (for NLP and time series analysis), spatially sparse convnets, graph and non-Euclidean space convnets, ... exist and are used.

CNNs are akin to multiscale wavelet transforms. They can be applied on different spaces (just as graph wavelet transforms exist).


> CNNs only make sense when working with image data anyways.

Not true, CNNs are used for audio and text as well.

I don't think the title is clickbait, you may be misinterpreting it. Its referring to using CNNs on higher dimensional inputs, not that the layer has multiple dimensions (which has been done since the creation of convnets)


Wow this is actually fairly close to my prediction for the HN Next Decade Prediction post.

Edit: I should say it is a big step in the direction of my prediction.


What prediction, what post?


Post: Ask HN: A New Decade. Any Predictions?

https://news.ycombinator.com/item?id=21941278


yeah, which comment?


https://news.ycombinator.com/reply?id=21945829&goto=threads%...

>My big prediction is in physics: Thanks to Einstein we live in a 4D reality, 3 spatial dimensions and the dimension of time.

In the next 10 years I predict our understanding of physics will evolve (with a confirmation through observation) from us being 3D beings living in a 4D reality to...something more.


I think OP realized that linking the comment will reveal their real HN account name lol.


Nope, I just didn't realize linking to the post wasn't enough to find my comment.

Figured once there it would be pretty trivial to search for my name.

anyway linked above you, so know you have something else to lol about


With "throwaway" as the prefix to your account name, it should be fair to assume that's not your main account... Also, I don't really see how your prediction has anything to do with gauge theory applied to CNN, unless I'm missing something much deeper here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: