Hacker News new | past | comments | ask | show | jobs | submit login
A Theory on Adam Instability in Large-Scale Machine Learning (arxiv.org)
140 points by vov_or on July 18, 2023 | hide | past | favorite | 51 comments



"The name Adam is derived from adaptive moment estimation.", https://arxiv.org/abs/1412.6980


This is not a coincidence, because nothing is ever a coincidence.


Is it possible to use derivative-free/black-box optimizers to train these large networks?

From what I understand, gradient descent and its cousins can't suddenly jump to a distant global optimum.


I have actually attempted this recently. I took a small 10M parameter Shakespeare language model used as an example in nanoGPT, swapped out gradient descent, tested various black-box optimizers from what I could find in literature.

It takes 3 minutes to train the Shakespeare model with gradient descent. The black-box methods I tested so far likely take 30+ hours to train (I haven't tried to take them to the end yet). I've hit a wall where progress is very slow. The text generated at that stage has punctuation and words are split with spaces but the words themselves are mostly nonsense. Almost feels like it learned that English is letters separated by spaces, and that you put exclamation marks or periods at the end but not that much more.

There's some larger scale CMA-ES variants I still want to test that don't have quality implementations. I've tried to stare at pictures of gradients and weights from half-trained models and trying to come up with ideas how to get there with black-box optimization. Also trying some original ideas where you compute a gradient, but you would not compute it against a loss function. The gradient would be more for discovering hidden structure in weights, that you would then put on some black-box optimizer as a guide (which I guess makes it not entirely black box. Gray box?)

Possible? I mean, I guess technically. Practical? No way, unless some major breakthrough happens.

My current goal is to just produce a model, even if training takes laughably long so I can say I've trained a language model using nothing but getting a fitness score from a black box function.

Edit: if you are reading this and are aware of any other serious attempts at training a non-trivial sized language model without gradient descent I would want to know. So I can try their methods. I know there's some large scale stuff used in reinforcement learning like in one Uber paper but not in LLMs specifically.


Not really surprising-- CMAES basically replaces the actual gradient you care about with a rough numerical approximation to it that's based on looking at lots of input-output pairs. I think the concept originated in surveying, where it's called the technique of "kriging":

https://www.publichealth.columbia.edu/research/population-he...

https://en.wikipedia.org/wiki/Kriging

Basically, you are wasting most of your compute to come up with a rough local approximation to the thing you actually want. But that's sort of pointless in the NN training context, because what you want is basically the gradient (and maybe some higher order terms that tell you about the local curvature too).

CMAES makes sense when the gradient is not even well defined. For example, if you have a bunch of parameters for an airplane design, and then want to take that design and do a bunch of huge aerodynamics calculations to compute its lift, or do a big finite element analysis to measure how well it withstands various stresses, and at the end of that big analysis, you get back a number, like "maximum lift" or something. If each run takes hours on a supercomputer, then you clearly don't have anything close to a gradient and it would be very expensive to even try to approximate it numerically. So CMAES is useful there in helping you pick better high level parameters in a smart way-- basically it's a big improvement over grid search.


I think I saw a paper that argued that CMA-ES is making an approximation to the natural gradient, which is not the same gradient you see in typical NN trainings. Or at least so I understood it. (I have no background in data science or ML, I'm just a bored engineer)

I haven't estimated the number of trials you would need for 10M Shakespeare model but I think to get to the same level as gradient descent, it might be around 10M, i.e. same ballpark as the number of parameters. Which makes some intuitive sense because of how little you learn from each black box function evaluation.

There's maybe some hope that there is a hidden structure in the problem that does not actually need anywhere close to 10M parameters so that a black box optimizer might still find it. I don't have my hopes up though but I'm trying to poke at it.

I would think that if it turns out LLMs are not totally impossible with black box optimization, then it would be good to find a reason to use it. E.g. objective functions that don't have a hope of having a good gradient. Some kind of weird architecture that can't be trained conventionally. Maybe fine-tuning gradient descent optimized models with those things. Etc. Feels like a solution looking for a problem.

I'm doing my project for fun and just seeing if it's possible at all.


Have you tried Hinton's forward-forward method?


I have not. I have read the paper though. I do want to try it and likely will at some point.

Next up after this project is that I want to test some metalearning ideas. I read some papers where the idea is that all weights are actually tinier neural networks, all with the same parameters where you train it to learn backpropagation (or whatever learning algorithm it converges to). The paper I read this from argued it also worked for forward-only but my intuition doesn't quite understand how. I want to follow up a bit on this line of research and check if there's been any new developments since I read them and try them out in my own code.


Very interesting, thanks for sharing. I would be interested in reading more on gradient-free optimization applied to large problems (like LLMs).


No one has the scale to make that happen.

It's about information. Gradient-free methods integrate little or no information about the problem; they're a blind watchmaker. This works, but it's slow and gets slower the bigger your problem is. (curse of dimensionality)

Gradients integrate some limited information about the problem. This lets you find solutions much faster, and neural networks are structured specifically to be easy to optimize with gradients. Local minima don't seem to be a problem.

The future is probably even smarter optimizers that integrate more information about the problem and learn to make good assumptions. This is the goal of Learned Optimizers, like Velo (https://arxiv.org/abs/2211.09760).


Thank you for posting Velo optimizer - very interesting work.


You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.

But it doesn't solve the problem of local minima, and it will also need to use minibatches.


People have tried, but all the local minimums perform about the same so there's no point trying to find a global minimum. A much better strategy is to train multiple models and use all of them at inference time to score better


You have a gradient so use it instead of faffing about. As another user said, the optima are all the same since the models are wildly over-parameterized.


This is tersely stated, but it's wise. In general, following the gradient (if you have it) is a very, very good idea.


First order optimizers have trouble because they fall into local minima, however, in practice things are different.

When your parameter space is in the order of billions, for all practical purposes, there is always a direction of descent.

More over, local minima seem to be rather close to the global minima.


I believe there are still ongoing efforts in this. See for example the forward forward algorithm.

https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844...


particle swarm optimization, genetic algorithms, and tabu search/heuristic search are some items I'm aware of to force out of local optimum. Using Halton sequences can also help cover the space for search initialization, versus simple random draws in a space.


From my granted limited understanding Adam is basically gradient decent combined with heuristic search.


Adam is somewhat analogous to an audio compressor on the gradient "signals".

(Edit: eh sort of ...)


Yes, and the only reason it doesn't work is that no one has written truly fast, GPU implementations of them. Don't let anyone here teach you otherwise, even small scale crappy versions (like what I could code in numpy) can successfully solve reinforcement learning problems rather quickly. Nay-sayers might tell you that it doesn't work, but they are wrong. Global optimization is strictly superior to local optimization in general, and we in the AI field are stuck deep in a local minimum right now.

Here's me implementing an algorithm from 2009 in single-core on a CPU and getting pretty excellent results on RLHF benchmarks: https://github.com/Hellisotherpeople/Python-Cooperative-Syna...


The whole argument is that at large scales with billions of params it doesn’t matter specifically because of those billions so giving a toy example seems to miss the point.


No.


More correctly: we don't know how to do this efficiently. Biological neural networks don't use backpropagation and work great.


Conference takes years, though... It's entirely possible that back prop is unrealistic but far more efficient than biological learning.


Out of my depth so happy to be corrected.

Don’t many/most state of the art models take many months to train on far more data than humans need for similar tasks?

Also, while e.g. GPT4 is quite capable across many tasks - humans seem to average towards learning robust _learning techniques_ themselves. Learning a new subject becomes easier thanks to somehow tracking and encoding learning strategies that are robust to learning other unrelated topics.


> Don’t many/most state of the art models take many months to train on far more data than humans need for similar tasks?

Humans generally need 18 years of pre training followed by 4-6 years of fine tuning before they can “one-shot” many difficult tasks. That’s way more training than any machine learning model I’m aware of.

Even for tasks like reading the newspaper and summarizing what you read, you probably had to train for 10-12 years.


I see this stance of yours parroted over and over but a 3 year old can tell a dog from a cat doesn't need to be trained on millions of images. Also uses way less energy for that.


It takes basically a week on a single GPU to train AlexNet which has human level ImageNet performance. Let's say it's 500 W for the GPU versus around 10 W for a human brain. So that's 84kwh for the model and 175kwh for the baby (over 3 years at 16h/day). That's without a half billion years of architecture and initialization tuning that the baby has. I think the model performs very favorably.


I don't. This is so obscenely flawed in obvious ways. The energy to train the model was used only for model training while the energy used by the baby performed a myriad of tasks including image recognition, and can presumably apply the knowledge gained in novel ways. Not only can a baby identify a cat and a dog but it can also speak what the difference is in audible language, fire neurons to operate its musculoskeletal system (albeit poorly), and perhaps even no longer shits its pants. Apples and Oranges. Is model performance getting more impressive every day? Definitely. Has anyone actually demonstrated "AI". Still nope.


The context of this thread is the cost of training brains and models on comparable tasks. Not that the model is comparable to a human in every way.

If you want to be pedantic then 6% of the human brain is the visual cortex but then you also have to argue that AlexNet is horribly inefficient to train. So you cut the brain cost to 6% and the model cost to 1%. They're still within an order of magnitude (favoring the model) which I'd say is pretty close in terms of energy usage.


That context is more narrow than I suggested.


But don’t 3 yr olds have more skills than distinguishing dogs from cats?


Sure but my point is that the energy costs are in the same magnitude.

If you want to be pedantic then only 6% of the human brain is the visual cortex. But AlexNet is also an inefficient model so something like an optimized ResNet is 100x as efficient to train. So now you're at 10.5kwh and 1.5kwh for the baby and model respectively.

You can argue details further but I'd say the energy cost of both is fairly close.


You’re missing my original point which is about continued, ongoing robustness that works in the low data regime and allows pilots/astronauts to make reasonable decisions in _completely novel_ situations (as just one example).

The networks we have are trained once and work efficiently for their training dataset. They are even robust to outliers in the distribution of that dataset. But they aren’t robust to surprises, unmentioned changes in assumptions/rules/patterns about the data.

Even reinforcement learning is still struggling with this as self-play effectively requires being able to run your dataset/simulation quickly enough to find new policies that work. Humans don’t even have time for that, much less access to a full running simulation of their environments. Although we do generate world models and there’s some work towards that I believe.

Again happy to be corrected.


Not necessarily! Feel free to gather some data on 3 year old skills on imagenet... I'll wait.


> but a 3 year old can tell a dog from a cat doesn't need to be trained on millions of images

A 3 year old has 3 years of multimodal training data and RLHF + a few billions of years of evolution that have primed and biased our visual and cognitive systems.

That requires a lot more data than machine models that literally zero inherent bias. Assuming you want a true apples to apples comparison.


I can train a bird song recognition model in about two days in a v100 which performs decently well on upwards of three thousand species, and generalizes reasonably to real world data (beyond a somewhat skewed training data distribution).

Humans are very bad at this task; it takes a massive effort to learn this many birds. In fact it's a great counterexample to human few shot learning ability...


This is under the assumption that brains start at random (the Tabula Rasa theory of the brain) but that doesn’t seem plausible to me. Brains have the benefit of some amount of pre training at the time of birth. That’s why spiders don’t need to be taught how to spin complex webs and why humans don’t need to learn how to manipulate abstract mental symbols (i.e. language).


A 3 year old still takes 3 years to train. Even a state of the art image recognition model takes way less time than that.


The moronic thing about this is, humans aren't just training, they're having a life. The "training part" is one part of it sure, but it's not the reason for existence.


What's the functional difference between training and having a life?


Not sure if you meant this as a joke, but it made me smile. That poor child!


The structure of that 3yr old’s visual cortex is itself the result of a ~500 million year optimization process.


When gradients become auto-correlated, they are not zero mean. In that, case vₜ can grow large and uₜ becomes very small. Maybe try centering the g² term in vₜ like (gₜ-mₜ)². Also, when restarting training, m and v are probably reset to zero.

Might not be related to the phenomena in the paper.


So it happens that when training LLM some training batches worsen them model, but such batches actually improve it when fed later, why?


It makes intuitive sense.

If I speak Japanese at you (and you are a non-Japanese speaker), I will just confuse you. Instead if you spend two years learning Japanese, then I share some information with you in Japanese, you will learn something new and become more knowledgeable.


That is a good analogy. The insight is improved by realising that in the human context the confusion is temporary and results in the rejection of the data. In the LLM it is forced into the matrix in the incorrect context, so it is harmful.


"In this work, we argue that the training loss instabilities observed in large-scale training should be associated with the time-domain correlation between the gradient estimates of earlier layers in the deep-learning models. Based on the identified connection, we propose several ways to mitigate the instabilities, along with the heuristic method that was known in the literature. We conclude that at this point, there is no silver bullet to solve the problem, and the appropriate remedy depends on the specific setup of the large-scale training run."


So, it's a form of superstition?


This may be naive, but the gradient seen during one training batch would not depend only on the content of that batch, but also the outcome of all previous batches (or so I suppose.) If that is so, then whether one of these spikes occur is not only a function of the batch, but also the sequence of prior batches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: