Neural network training makes beautiful fractals

alexmolas · on Feb 13, 2024

The results of the experiment seem counterintuitive just because the used learning rates are huge (up to 10 or even 100). These are not the lr you would use in a normal setting. If you look at the region of small lr it seems all of them converge.

So I would say the experiment is interesting, but not representative of real world deep learning.

In the experiment, you have a function of 272 variables with a lot of minima and maxima, and at each gradient descent step you take huge steps (due to big lr). So my intuition is that convergence is more a matter of luck rather than hyperparameters.

lopuhin · on Feb 13, 2024

If convergence were a matter of luck, it would look completely different, like white noise, but it clearly has well-defined structure.

The reason for high learning rate is that they used full batched training (see the first cell in https://colab.research.google.com/github/Sohl-Dickstein/frac...), and when batch sizes are large, learning rates typically can be large as well. Plus as others said it's more of a toy problem, it would be hard to get such detail on anything non-toy.

dmarchand90 · on Feb 13, 2024

I think the article is very honest about this just being a fun exploration. They even show how you can get similar patterns with newton's algorithm which is a more "classical" take

alexmolas · on Feb 13, 2024

Yes, the author is clear in this regard. But I've seen people interpreting this paper as "training deep networks is chaotic", and I don't think that's the case. I interpret it more as "if you're not careful with your learning rate your training will be chaotic"

repelsteeltje · on Feb 13, 2024

Yes. I immediately had visions of a misconfigured PID controller or similar chaotic emergence encountered in control theory. I was surprised and delighted though, that this type of chaos has so much fractal beauty.

nyrikki · on Feb 13, 2024

Note that not all fractals are chaotic.

Inter dimensionality is the defining feature of fractals.

A multilayer ANN will have compression that may be fractal like. But in theory a feed forward network could be single layer but will a lot more neurons.

I have the feeling this result was due to representation but there are some features that look like riddled basins.

Riddled basins do arise in SNN or spikey neutral networks that have continuous output bs the binary output of ANNs.

But as all PAC learning is just compression that may be the cause also.

I'll be digging into this on the weekend though.

breather · on Feb 13, 2024

With all do respect, why are you commenting here rather than replying to the person giving your comment context?

sdenton4 · on Feb 13, 2024

There was a nice paper about five years ago that also found that the best hyperparams were in the bounty of divergence, even for 'real' imagenet models.

locuscoeruleus · on Feb 13, 2024

What paper was that?

sdenton4 · on Feb 14, 2024

Here you go: https://arxiv.org/abs/1811.03600

vintermann · on Feb 13, 2024

Yes, but as he also says, the best hyperparameters are near the boundary. You want to be a bit greedy here.

alexmolas · on Feb 13, 2024

The way the author defines best hyperparameters is also a bit weird. To assign a "score" to each pair of HP the author sums all the losses during training (ie score = ∑ᵀₜ₌₀ lₜ) instead of just taking the value of the loss at the last epoch (score = l_T). This makes the converging solutions with high learning rates appear better since they have lower losses since the beginning (so the sum is smaller), but it doesn't mean that the final value of the loss is better than the ones with smaller learning rates.

telotortium · on Feb 13, 2024

Twitter: https://twitter.com/jaschasd/status/1756930242965606582 ArXiv: https://arxiv.org/abs/2402.06184

Abstract:

"Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations."

Contains several cool animations zooming in to show the fractal boundary between convergent and divergent training, just like the classic Mandelbrot and Julia set animations.

PheonixPharts · on Feb 13, 2024

I find this result absolutely fascinating, and is exactly the type of research into neural networks we should be expanding.

We've rapidly engineered our way to some very impressive models this past decade, and yet gap in our real understanding of what's going on has widened. There's a large list of very basic questions about LLMs that we haven't answered (or in some cases, really asked). This is not a failing of people researching in this area, it's only that things move so quickly there's not enough time to ponder things like this.

At the same time, the result, unless I'm really misunderstanding, gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.

sigmoid10 · on Feb 13, 2024

>exactly the type of research into neural networks we should be expanding.

While it certainly makes for some nice visualizations, the technical insight of this is pretty limited. First of all, this fractal structure emerges at learning rates that are far higher than those used in training actual neural networks nowadays. It's interesting that the training still converges for some combinations and that the (expected) hit-and-miss procedure yields a fractal structure. But if you look closely at the images, you'll see the best hyperparameters are, while close, not at the border. So even if you want to follow the meta-learning approach outlined in the post, your gradient descent has already screwed up before if it ever ends up in this fractal boundary region.

johan_felisaz · on Feb 13, 2024

There is an expanding field of study looking at machine learning with statistical physics tools. While there is still a lot of work to do in this area, it yields interesting insights on neural networks, e.g. linking their training with the evolution of spin glasses (a typical statistical physics problem). We can even talk about phase transition and universal exponents.

Most of the research is done with simpler models though (because mainly math people do it, and it's hard to prove anything on something as complex as a transformer).

krallistic · on Feb 13, 2024

> gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.

The visualizations only show that at the Border there are a lot of fractals, not in every part of the space. (Although the highest performance is often achieved close to the border.). I would not state hparam search as bad as that..

dylan604 · on Feb 13, 2024

what was the name of the Google app (Dreaming or some such) that would iterate frames similar to this that would find lizard/snakes/dogs/eyes and got super trippy the longer it ran? The demos would start with RGB noise, and within a few iterations, it was a full on psychedelic trip. It was the best visual of AI "hallucinating" I've seen yet

0xDEADFED5 · on Feb 13, 2024

https://en.wikipedia.org/wiki/DeepDream

dylan604 · on Feb 13, 2024

winner winner chicken dinner. thanks!

always thought we now know what the android's dreams were like

Sharlin · on Feb 13, 2024

DeepDream is essentially what started the current image-creating generative AI craze.

vintermann · on Feb 13, 2024

Yes. It all started with a leaked image on Reddit. Even though at that point it was just a rumour that it was generated by a neural net, it was so strikingly unlike anything else that it caused a huge stir. Wish I could find that image again, it was a fairly noisy "slugdog" thing.

Imnimo · on Feb 13, 2024

If you are a fan of the fractals but feel intimidated by neural networks, the networks used here are actually pretty simple and not so difficult to understand if you are familiar with matrix multiplication. To generate a dataset, he samples random vectors (say of size 8) as inputs, and for each vector a target output, which is a single number. The network consists of an 8x8 matrix and an 8x1 matrix, also randomly initialized.

To generate an output from an input vector, you just multiply by your 8x8 matrix (getting a new size 8 vector), apply the tanh function to each element (look up a plot of tanh - it just squeezes its inputs to be between -1 and 1), and then multiply by the 8x1 matrix, getting a single value as an output. The elements of the two matrices are the 'weights' of the neural network, and they are updated to push the output we got towards the target.

When we update our weights, we have to decide on a step size - do we make just a little tiny nudge in the right direction, or take a giant step? The plots are showing what happens if we choose different step sizes for the two matrices ("input layer learning rate" is how big of a step we take for the 8x8 matrix, and "output layer learning rate" for the 8x1 matrix).

If your steps are too big, you run into a problem. Imagine trying to find the bottom of a parabola by taking steps in the direction of downward slope - if you take a giant step, you'll pass right over the bottom and land on the opposite slope, maybe even higher than you started! This is the red region of the plots. If you take really really tiny steps, you'll be safe, but it'll take you a long time to reach the bottom. This is the dark blue section. Another way you can take a long time is to take big steps that jump from one slope to the other, but just barely small enough to end up a little lower each time (this is why there's a dark blue stripe near the boundary). The light green region is where you take goldilocks steps - big enough to find the bottom quickly, but small enough to not jump over it.

nighthawk454 · on Feb 13, 2024

Great description! Now we just need one on fractals for all us NN people haha

magicalhippo · on Feb 13, 2024

Here's the associated blog post, which includes the videos: https://sohl-dickstein.github.io/2024/02/12/fractal.html

Not a ML'er so not sure what to make of it, beyond a fascinating connection.

fancyfredbot · on Feb 13, 2024

This is really fun, and beautiful. Also, despite what people are saying about the learning rates being unrealistic, the findings also really fit well with my own experience in using optimisation algorithms in the real world. If our code ever had a significant results difference between processor architectures (e.g. a machine taking an avx code path vs an sse one) you could be sure that every time the difference began during execution of an optimisation algorithm. The chaotic sensitivity to initial conditions really showed up there, just as it did in the author's newton solver plot. Although I have knew at some level that this behaviour was chaotic it never would have occurred to me to ask if it made a pretty fractal!

why_only_15 · on Feb 13, 2024

I appreciate that his acknowledgements here were to his daughter ("for detailed feedback on the generated fractals") and wife ("for providing feedback on a draft of this post")

fallingfrog · on Feb 13, 2024

This is kind of random but- I wonder, if you had a sufficiently complex lens, or series of lenses, perhaps with specific areas darkened, could you make a lens that shone light through if presented with, say, a cat, but not with anything else? Bending light and darkening it selectively could probably reproduce a layer of a neural net. That would be cool. I suppose, you would need some substance that responded to light in a nonlinear way.

gwern · on Feb 13, 2024

You can do a lot. 'Digital sundials' come to mind: https://en.wikipedia.org/wiki/Digital_sundial https://gwern.net/doc/math/1991-stewart.pdf

But yes, you are restricted to linear things and you can't make a good photonic cat detector out of that easily. So all the photonic neural networks you may have heard of like https://arxiv.org/abs/2106.11747 wind up sticking some mechanical or electrical nonlinearity somewhere.

uoaei · on Feb 13, 2024

You can simulate materials, apply the wave equation, and get "layers" that compute outputs from given inputs, each modeled as points in space. It may be possible to manufacture such layers with metamaterials or something like that.

https://www.science.org/doi/10.1126/sciadv.aay6946

theWreckluse · on Feb 13, 2024

A research out of ETH Zurich, based on which the company Rayform https://rayform.ch/ was founded does exactly this! I was so excited when I saw the paper for the first time a couple of years ago.

mhh__ · on Feb 13, 2024

I've seen a student project that did just that. I don't have a link for you unfortunately.

mchinen · on Feb 13, 2024

This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).

I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.

arkano · on Feb 13, 2024

If you liked this, you may also enjoy: "Back Propagation is Sensitive to Initial Conditions" from the early 90's. The discussion section is fun.

https://proceedings.neurips.cc/paper/1990/file/1543843a4723e...

radarsat1 · on Feb 13, 2024

I'm really curious what effect the common tricks for training have on the smoothness of this landscape: momentum, skip connections, batch/layer/etc normalization, even model size.

I imagine the fractal or chaos is still there, but maybe "smoother" and easier for metalearning to deal with?

Wherecombinator · on Feb 13, 2024

This is pretty interesting. Can’t help but he reminded of all the times I’ve done acid. Having been deep in ‘fractal country’ a few times I’ve always felt the psychedelic effect is from my brain going haywire and messing up its pattern recognition. I wonder if it’s related to this.

KuzMenachem · on Feb 13, 2024

Reminds me of an excellent 3blue1brown video about Newton’s method [1]. You can see similar fractal patterns emerge there too.

[1] https://www.youtube.com/watch?v=-RdOwhmqP5s

int_19h · on Feb 13, 2024

I hope one day we'll have generative AI capable of producing stuff like this on demand:

https://www.youtube.com/watch?v=8cgp2WNNKmQ

passion__desire · on Feb 13, 2024

Not just that AI can overlay artistic styles like below on top of such fractal structures.

http://sub.blue/inkwell

int_19h · on Feb 13, 2024

Generating a single still image is not an issue, obviously. It's more about picking "good" fractal parameters, a "good" spot to zoom in on said fractal, and "good" colors to dress it all up. Even more so if you are trying to time it all to specific music matching the visuals, as some of the fractal art videos do.

kalu · on Feb 13, 2024

So this author trained a neural network billions of times using different hyper parameters ? How much dod that cost ?

tpm · on Feb 13, 2024

"overnight on an A100" per https://twitter.com/jaschasd/status/1757056127991439832

TheCoreh · on Feb 13, 2024

The networks trained were really small, with only one hidden layer, and a width of 16.

mkl · on Feb 13, 2024

Very small networks are cheap and easy (even 20 years ago).

milliams · on Feb 13, 2024

I'd argue that these are not fractals in the mathematical sense, but they do seem to be demonstrating chaos.

bob88jg · on Feb 13, 2024

This is what I Initially thought too but I am less certain now - I assumed Fractal implied self similarity (which this doesn't seem to have) but this is in fact not true - I think to actually say if its fractal or not someone needs to estimate its dimension using box counting or some, way beyond me, analytical method.

wang_li · on Feb 13, 2024

They should have definition at arbitrary scales. These images do not, they are built from a fixed sized matrix. Beyond a certain point you are between the data points.

bob88jg · on Feb 13, 2024

Same can be said for the madelbtot set or any other fractal - there is a physical bound on the precision of the values - you can in theory run a NN with arbitary precision floats...

wang_li · on Feb 14, 2024

But the image is based on the attributes of each node in the NN matrix. The Mandelbrot set can be calculated at any precision you desire and it keeps going. An NN is discrete and finite.

bob88jg · on Feb 15, 2024

That's not what it shows at all - it shows how varying hyper parameters (which are floats and thus can be set at any arbitary precision) effects the speed at which convergence happens - so its some function F: R^n -> Z - it has literally nothing to do with the nodes in the neural network...

karxxm · on Feb 13, 2024

What’s that color-map called?

nossid · on Feb 13, 2024

In the notebook you can see it set to Spectral.

https://github.com/Sohl-Dickstein/fractal/blob/main/the_boun...

albertgt · on Feb 13, 2024

Dave Bowman> omg it’s full of fractals

HAL> why yes Dave what did you think I was made of

7e · on Feb 13, 2024

Today I learned that if something is detailed, it is now fractal.

lifthrasiir · on Feb 13, 2024

Fractal is not necessarily self-similar; it just has to show enough detailed structures when zoomed in ad infinitum. While no single defiition for fractals exists, at least that's one of the common denominators (because otherwise many "fractals" in the nature can't be called so).

magicalhippo · on Feb 13, 2024

It would be interesting to compute the fractal dimension[1] to see if it really is fractal[2] or just looks like it.

I recall similar tests being done on paintings by Pollock and of other artists trying to copy his style to determine authenticity[3].

[1]: https://en.wikipedia.org/wiki/Fractal_dimension

[2]: https://mathworld.wolfram.com/Fractal.html

[3]: https://cpb-us-e1.wpmucdn.com/blogs.uoregon.edu/dist/e/12535...

bluetintedfort · on Feb 13, 2024

The sister paper[1] contains the fractal dimensions of some of the images using a boxcount paper.

[1]: https://arxiv.org/pdf/2402.06184.pdf

bob88jg · on Feb 13, 2024

Thanks!! I was also under the impression Fractals required self similarity and not just infinite detail but glad to have that misconception corrected!

I think it probably stems from the fact that all the fractals, I have seen, for which the dimension can be analytically calculated do show obvious patterns of similarity at different scales.

magicalhippo · on Feb 13, 2024

Searching I stumbled over this paper[1], which I found interesting, on characterizing the fractal-ness of geological structures.

They point out that the usual measure of fractal dimension, or capacity dimension[2], doesn't consider the physical size of the features and can thus be inaccurate. Instead they suggest using the information dimension[3], which is bounded by the capacity dimension.

[1]: https://doi.org/10.1016/j.chaos.2018.05.008 (full text available on that hub of science)

[2]: https://mathworld.wolfram.com/CapacityDimension.html

[3]: https://mathworld.wolfram.com/InformationDimension.html

tudorw · on Feb 18, 2024

Seems there are a few more, Higuchi Dimension, Box Counting Dimension (D0), Generalized Rényi Dimension and others are discussed mid-way through this;

https://medium.com/@h.a.papageorgiou/the-reality-of-the-ruli...

magicalhippo · on Feb 13, 2024

Ah I had a quick read before I went to bed last night, missed that bit, cheers.

paulrudy · on Feb 13, 2024

I'm only a fractal enthusiast, but my impression is that the key distinction that makes these fractals, or at least fractal-like, is not their detail per se, but that there is complexity at every scale. From the article:

> we find intricate structure at every scale

> At every length scale, small changes in the hyperparameters can lead to large changes in training dynamics

bob88jg · on Feb 13, 2024

"small changes in the hyperparameters can lead to large changes in training dynamics"

This is the definition of Chaos though no? Butterfly flaps its wings, hurricane on other side of the planet...

paulrudy · on Feb 15, 2024

As far as I am aware, these kinds of nonlinear relationships are a feature of fractal dynamics, but I'm not a mathematician.