Hacker News new | past | comments | ask | show | jobs | submit login
The Mathematics of Training LLMs (latent.space)
199 points by FanaHOVA on Aug 16, 2023 | hide | past | favorite | 66 comments



I think this is a lot of the mathematics of scaling LLM training. Which is quite important!

One fundamental requirement though for any machine learning engineer working on these kinds of systems is https://people.math.harvard.edu/~ctm/home/text/others/shanno.... I do not want to be entirely hypocritical as I am still ingesting this theory myself (started several years ago!), but I've found it _absolutely crucial_ in working in ML, as it implicitly informs every single decision you make when designing, deploying, and scaling neural networks.

Without it, I feel the field turns into an empirical "stabby stab-around in the dark" kind of game, which very much has its dopamine highs and lows, but ala Sutton, does not scale very well in the long-run. ;P


There's something I like to tell people: "You don't need math to make a good model, but you need math to know why your model is wrong." All models are wrong, and actually one of the crucial points we're at is not enough concentration on how wrong our models are (or how wrong our evaluations are, which is also a type of model). I suggest also investing time in statistics, understanding higher dimensional spaces (topologically and statistically), and metric theory.

It is always a "stab-around in the dark" unfortunately, but the deeper your mathematical understanding is the brighter your candle burns to help you poke around. I think a lack of mathematical understanding has also made people misread Sutton as endorsing "scale is all you need" rather than "flexibility combined with scale has historically resulted in the largest gains." These things are very different.


This is well put.

Mathematics is our candle in the darkness of our understanding, and the more math we know the brighter our candle flames.

But Mathematics is not a candle, it is the purest analogy we have ever encountered, and it enlightens our minds.

It gives us an appreciation of scale, and measurement thereof, that is unmatched, yielding deep insights into itself and everything else.

I just wish everyone learned number theory and geometry and algebra and physics and and and...

I just wish everyone would read the Road to Reality by Penrose.

Models are little stepping stones that are slippy and unsure, but we must use them to take further steps into the darkness of our ignorance.


> I just wish everyone learned number theory and geometry and algebra and physics and and and...

I'm less concerned with everyone having knowledge as much as I am concerned about everyone acting like an explorer and finding joy in the exploration simply for exploration's sake. Having targets and goals are good, we probably shouldn't always wander aimlessly (though this shouldn't be discouraged!), but concentrating too much on the target takes away the joy and makes us miss many beautiful things. I cannot find a stone that was turned which didn't end up providing utility, even if it didn't at the time. The competition can make the game more fun, but too much ruins it. It makes us loose sight that it is a game in the first place and why we play. We enrich the lives of ourselves and others by playing it, but we can also have full lives were we to all stop. I think we sometimes forget why we're playing and just do because others are or because personal gains. It's like trying to pick winners in a cooperative never ending game. Reminds me of my friend who would jokingly ask "did you win" when I mentioned a recent game of DnD. The fierce competition also discourage others from playing, but I just wish I could show them how beautiful and awesome this game is. To not let some assholes ruin our fun.

I think currently we have lost sight of why we play the game and are too caught up in figuring out who the best player is. In reality such a thing is non-determinable due to the system complexities and timescales. I also don't really see any good utility other than an ego boost but comes at the cost of hyper-focusing on temporarily near outlook rather than exploring further. We need many types of explorers and I'm not quite understanding why we've structured the system to reward only one type. Simplicity?


Let's play D&D

And let's use our own intelligence to do so

Intelligence is practiced thought, wisdom is practiced intelligence, love is practiced wisdom


Can I roll an acrobatics check to do a sick backflip and try to impress the kobold?


The kobold offers you a job for his son's bat mitzvah in the coming months.


> I just wish everyone learned number theory and geometry and algebra and physics and and and...

When people ask why do CS Majors learn so much maths? "They won't be using it for programming!"

Maths teaches the CS major how to think. Sometimes things are not A->B and are very abstract like math problems.


> When people ask why do CS Majors learn so much maths?

CS majors learn a lot of math?


We also had math classes masquerading as CS courses.

For example, Algorithms with Big O notation analysis


Oh gosh, yes. I only had to 'upgrade' one class to an effectively harder version to get a minor in math. :'||||


I mean I'm biased since I did a physics undergrad _and_ took additional math classes, but from what I'm aware of, even top universities require less math from CS majors than they do for even the easiest Engineering degrees. Calc II (series, not multi-variate) is not "high level" in a modern context (maybe in a general social context) and + Linear Algebra is nowhere near "a lot" unless you compare to majors who do not even take calculus.

So pretty much: when comparing to other STEM majors, I'm given the impression that CS majors are near the bottom of math requirements. At both undergrad and graduate levels.


I went up through Calc III, and of course PHYS I and II w/ calc is a barebones requirement. Discrete, linear, stats, the basics there.

I think it varies per college. Calc II was the very first class I took on the very first day of my freshman year of college. It was really a decent bit of work but not unbearable.

Most CS programs are pretty math-intensive and this seems to be the general community consensus, I'm not sure where you're getting this impression from.


I mean specifically in the context of STEM degrees. I was curious about my intuition so I did a quick look. Used Stanford as a baseline and their sample plans if available. CS looks on par with many but slightly south of the median since several engineering degrees require more math. The ones that require less are "Management Science & Engineering" and "Materials Science and Engineering". But typically similar to chem, bio, and most engineering degrees, which all have more science courses which can teach additional mathematics (generally in calc and stats). Being close to the median I wouldn't call this "a lot," even if there is low variance in that distribution. Where I'd probably draw that line is beyond the general 3 calcs and linear algebra. Personally I'm generally unimpressed with most intro stats courses.

Here's at least some flow charts for engineering degrees (which has a CS program): https://ughb.stanford.edu/plans-program-sheets/program-sheet...

Additional note: I can at least attest that in my university Calc 3 (multi variate) is not a requirement and this does not appear to be out of the norm. Interestingly side note, same with ethics which is exceptionally common in other stem programs (I believe a federal requirement in engineering).


We were given the option for two more classes to get a Math minor

I was a senior and wanted to just graduate, so I skipped it


Ha! I think we may be on the same page with Sutton, that and the misuse of the NFL theorem are the two disclaimers I put out the most. My most recent Sutton one was ~3 hours ago! (https://news.ycombinator.com/item?id=37129921#37151192).

That's a really good point, and I stand corrected. I guess it is still very much stabbing around in the dark, just with some more information about the probability density function of good answers. Heck, even Terry Tao's post about how to solve problems shows very much a (refined) guess-and-check method, so I have quite little ground to stand on there.

Metric theory is fun, and one I'd like to learn a lot more about. I certainly have a lot to learn there.


Yeah I think the nature of research is de facto searching in the dark. I mean if it weren't, would it really be research? I think of the analogy as the knowledge space is a dark void and our collective knowledge is permanent torches lighting up the area. Researchers are exploring into the darkness and finding suitable places to place the next torch. Maybe we can say that the torch's brightness is defined by how well known your work is, since people tend to the flames more (not to be confused with how important the work is).

The big question to me is about the geometry of that knowledge space. Is it bounded? If bounded (most likely imo), how do we approach the bound? Do we plateau once we hit it (unlikely imo)? Do we approach {,sub,super}-linearly? Do we approach {,sub,super}-linearly and then transition into an asymptotic convergence? (e.g. S-curve) This question actually has a profound impact on the question of the risk of a super-intelligence. IMO knowledge is most likely bound ( and most likely looks like an S-curve. If we're more than half way through that S-curve than a super intelligence that is 100x smarter than us may only know <100x as much as us, which we have good reason to believe that this is the case since knowledge appears to compound and accelerate (linear and sublinear models don't fit our historic trend, but that doesn't tell us the shape of the curve ahead). That reduces the risk of a super intelligence absolutely outsmarting us since it would be easier to close the gap, especially since learning through observation is easier than learning through exploration (can use its torches, even if they are more difficult to see). I'm not sure I see this really discussed much in the super intelligence conversations and that we're often working with different priors so that creates fundamental disagreements that become impassible. Maybe we need to set these priors first before we even discuss the next priors about motivation of a super intelligence.

At least for math, I don't think I have the capacity to learn it all in the short human life, even if we're just discussing what math would be helpful to ML. But it sure is fun to learn it, so I guess the inability to learn it all is not really a downside :) Now if only I can convince others that math is fun __and__ useful to ML.


Please list a single decision you've made that was directly influenced by Shannon's paper, and no I do not mean something you post hoc rationalized.


Sure. Basically everything in https://github.com/tysam-code/hlb-CIFAR10 was directly founded on concepts shared in the above paper, down to the coding, commenting, and layout styles (hence why I advocate so strongly for it as a requirement for ML. The empirical benefits are clear to me).

Before I sat down and wrote my first line, I spent a very long time thinking about how to optimize the repo. Not just in terms of information flow during training, but how the code was laid out (minimize the expected value of deltas for changes from a superset of possible code changes), to even the explanatory comments (ratio of space vs mental effort to decode the repo for experienced vs inexperienced developers). I really want it to be a good exemplary model of a different, more scalable, and more efficient way of conducting small-scale (and potentially resource-constrained) research. To do that, you have to maximize information efficiency at every stage of the pipeline, including temporally (!!!!).

It's not perfect, but I've used info theory as a strong guiding light for that repo. There's more to say here, but it's a long conversation about the expected utility of doing research a few different kinds of ways.


I am new to all this but I suspect it is good to at least know why cross entropy loss is a good measure.


Just a point of clarification -- are you looking to know why it is, or are you pointing out that it's important? I'm suspecting the latter but am happy to share some resources if it's the former.

(Also, the poster that you're replying to has a bit of a history of...stirring the pot in comment sections if you look. But hey, it let me show some of my open source work, so that's good, at least! ;PPPP)


I think it is useful to know what works well for foundational reasons so you know what to tweak and what is less likely yo work.

I would love to know more about why. I have hundreds of why questions about transformers! And deep learning in general. The main one being why are we so lucky that gradient descent is stable and works! Feels like a miracle and the opposite of the trickyness thrown at us by physics, maths and computer science!

I mean there are plenty of reasons I intuitively (just gut feel without mathematical intuition) think gradient descent should break down on big models. For example it could just become chaotic, nudging any parameter has a butterfly effect and the loss bounces around randomly and doesn’t converge.


Feel free to shoot, I can do the best that I can.

W.r.t. the weight updates, ah, but it is extremely chaotic, your intuition is correct I feel! :) :D It took me a lot of years to figure that out though, so good job on that one. We've just adapted our architectures to clamp down on those extra degrees of freedom and break that chained dependency from layer to layer. For example, how residuals softly 'force' a single, defined latent space for each group of layers that adds into a residual, how batchnorm tends to re-center and scale the data to prevent a statistical explosion -- decoupling it from learned scale and bias. How label smoothing basically can be viewed as an interpolation between raw cross entropy and predicting a uniform distribution every time -- what do you think that does to the internal sensitivity of the network? It's a great, label-free way to explicitly train for stability. If you look at the expected value of the magnitudes of cross entropy for output distributions of different entropies, you'll quickly see why an entropy-regulated loss greatly slows down the popcorn-bag of changing kernels (heh) to a dull mumble. Heavy ball momentum prevents a lot of oscillation because it smooths out our gradient updates over time -- surprisingly effective! And of course LR schedules like OneCycle that do a kind of LR annealing are very effective as well.

There's many, many, many more little things that go into that, and I think it took a few decades for us to find a lot of those things! We could go back 20 years with the ideological innovations that we've found and even on the constrained compute would be leagues ahead of what was available at the time, as a result.

Another cool effect is how the Lottery Ticket Hypothesis shows that with rolling enough 'tickets', one can find weights suitably close to a decent representation from the start -- and much of the training is just tuning, and potentially suppressing the noise from more poor initializations. That still somewhat goes over my head but it is a cool hypothesis indeed.

I'm not sure if that's a good teaser response or not, but I'm happy to talk at length about certain things. :)))) <3 <3 <3 :D


Thanks! That sounds like years of learning and experience distilled into a few paragraphs. Scary and interesting at the same time. Scary because learning math intensive stuff is 10x slower than coding stuff. I may never understand all that. Right now I am taking a break from learning deep and trying to have some fun, hence I am doing the fastAI course and hope to categorise dog and cat pics etc, and find a nice use case to impress the kids. But I will probably swing back round to this stuff at some point as it is fun (and unfun at the same time) to learn.


Yeah, they are great and some of the reason (up the causal chain) for some of the work I've done! Seems really fun! <3 :))))

Facebook's Segment Anything Model I think has a lot of potentially really fun usecases. Plaintext description -> Network segmentation (https://github.com/facebookresearch/segment-anything/blob/ma...) Not sure if that's what you're looking for or not, but I love that impressing your kids is where your heart is. That kind of parenting makes me very, very, very, happy. :') <3


Good challenge, I hope we will get a response.


Your wish hath been granted! <3 :))))


Do you mean that information theory in general is essential in working with ML systems, or a specific point raised by Shannon?


I'm sure there's a specific point raised by Shannon, and I've been (very recently!) learning how Shannon himself is not the sole info theory dude (even though lots of people credit him for it!).

But basically, the communication of information over a noisy channel _is_ the foundation for deep learning.


As they say, intelligence is prediction is compression is encoding...


Well, I agree that encoding is compression, at least, but the rest of that statement I do disagree with! It seems to be one of the more common mantras going around right now, though. Which is partially why I advocate for Shannon's theory! It's very clean and bare metal in terms of being related to intelligence, though I think one could build a decent argument for intelligence from info theory.


> Well, I agree that encoding is compression, at least, but the rest of that statement I do disagree with!

IMHO that Shannon's paper you linked is effectively the initial work for linking prediction with compression, showing how the information transmitted (and the necessity for transmitted information) decreases as you can more accurately predict the likelihoods of the next symbol.


Yes, but prediction != compression, and intelligence != prediction! However, prediction can inform compression, but there's necessarily some framework needed to translate the information from prediction<->compression. Perhaps that is too nitpicky, but to me it's like thinking about raw materials (like raw food or whatnot) and how it ends up as a finished product (like cornbread. Corn farming is not cornbread, but there is a semi-clear family of paths from one to the other with its own set of optimizations, rules, and etc).

Again, that could be a bit nitpicky on my end depending on how one is viewing it.


It probably is from the perspective of an information theorist. Did you read any interesting articles on the connections between deep learning and information theory to come to this conclusion? I‘m highly interested in this space, but the influence of information theory on deep learning developments appears to be negligible.


Aha! Yes, yes, yes! Good question. A senior scientist handed me a book form of this paper (plus the introduction by Weaver, which made things much easier for little, younger me!) From there, it's just been the process of getting to know the field better and making the connections over time. It's been about 7-8 years for me, so I've been deep into neural networks for a little while.

Not everyone uses information theory to inform research into neural networks, which is a darn shame, but many of the recent advances I feel can be trivially explained with info theory, or at least derived from it.

I can give a few basic examples about where it interacts with deep learning. One obvious one is the cross-entropy function. This of course is trying perform ERM use the MLE over the dataset where the neural network is the mapping function f(x) -> y. For a loss function, the cross-entropy is ideal in this case, as we want to choose a coding scheme that minimizes our regret on new data based on the statistics of some data we have (thus the 'E' in Empirical Risk Minimization, and the 'R', and the, oh gosh darn it, you get what I mean here). This is analogous to a communications process where the neural network f is emitting a target token with a likelihood determined by f(x). In this case, the softmax serves as that emission probability, and instead of optimizing a discrete system, we are optimizing a continuous relaxation of the probabilities of that discrete system in the limit.

You can go a lot deeper into the rabbit hole and rephrase basically everything in ML in that light. Doing so has rather significantly helped me push the bounds of performance in my area of research, at least. There is a lot to learn, however.

So I'm sure there are decent articles out there, but second-hand stuff will likely only be useful for getting a general gist of a problem, at the cost of also inheriting the common public model for a way of thinking about things. If you go off of the beaten path and try to translate the whole training process (don't forget the temporal aspect!) into this framework, I think you'll find it very challenging and interesting. Very fun! <3 :) Hope that answers the question, if not, then feel free to let me know! I think there are some learning materials that talk about the basics of info theory w.r.t. ML, but the rabbit hole goes....very deep indeedy.


Thanks for sharing this. How do you think the local LLM movement will evolve? Especially as in the post, you mentioned startups and VCs both hoarding GPUs to attract talent.

There seems to be a good demand behind tools like llama.cpp or ollama (https://github.com/jmorganca/ollama) to run models locally.

Maybe as the local runners become more efficient, we'll start seeing more trainings for smaller models or fine-tuning done locally? I too am still trying to wrap my head around this.


apart from the obvious GGML, we've done podcasts with both MLC/TQ Chen (https://www.latent.space/p/llms-everywhere) and Tiny/George Hotz (https://www.latent.space/p/geohot) who are building out more tooling for the Local LLM space!

there's actually already a ton of interest, and arguably if you go by the huggingface model hub it's actually a very well developed ecosystem.. just that a lot of the usecases tend to be NSFW oriented. still, i'm looking to do more podcasts in this space, please let me know if any good guests come to mind.


Training smaller models can be really compute intensive (a 7B models should get trained on 1.4T tokens to follow the "LLaMA laws"). So that would be C = 6 * 1.4T * 7B = 58.8T FLOP-seconds. That's 1/5th the compute of GPT3 for example, but it's still a lot. We asked Quentin to do a similar post but for fine tuning math; that's still a very underexplored space.

(Not to self plug too much, but this is exactly what last episode's with Tianqi Chen was about if you're interested :) https://www.latent.space/p/llms-everywhere#details


I want to jump in and correct your usage of "LLaMA Laws" (even you are using it informally, but I just want to clarify).

There is no "LLaMA scaling law". There are a set of LLaMA training configurations.

Scaling laws describe the relationship between training compute, data, and expected loss (performance). Kaplan et al., estimated one set of laws, and the Chinchilla folks refined that estimate (mainly improving it by adjusting the learning rate schedule).

The LLaMA papers do not posit any new law nor contradict any prior one. They chose a specific training configuration that still abide by the scaling laws but with a different goal in mind.

(Put another way: a scaling law doesn't tell you what configuration to train on. It tells you what to expect given a configuration, but you're free to decide on whatever configuration you want.)


Isn't the Chinchilla estimate considered to be wrong now?

https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...


Yep, +1. That's why I used the quotes. :) Thanks for expanding!


Yep I understood that you were using it informally, just trying to keep things informative for other folks reading too.


there frankly needs to be a paper calling this out tho, because at this point there are a bunch of industry models following “llama laws” and nobody’s really done the research, its all monkey see monkey do


But what would they be calling out?

If industry groups want to run a training run based on the configurations of a well-performing model, I don't see anything wrong with that. Now, if they were to claim that what they are doing is somehow "optimal", then there would be something to criticize.


poor choice of words, i probably mean sketching out the curves/doing ablation studies in a comprehensive way like the chinchilla paper did.


Makes sense! But expensive...


> C = 6 * 1.4T * 7B = 58.8T FLOP-seconds.

Sounds off by a few orders of magnitude since a RTX3090 can smash through about 78TFLOP/s.

I trained nanoGPT on a rented RTX3090 which is a much tinier model and takes about 12 hours on 100k steps (I need to check the params again but maybe 10M tokens tops)

And the reason is it is not a single flop per parameter per token. You have a deep architecture, compute intensive attention heads and back propagations to do as well.


that said there's more data efficiency to be gained in the smol models space - Phi-1 achieved 51% on HumanEval with only a 5.3x token-param ratio: https://www.latent.space/p/cogrev-tinystories#details


The current generation of models (Llama/Llama2) seem to pass a threshold of "good enough" for the majority of use cases at 60b+ parameters. Quantized 30b models that can run in 24gb of GPU VRAM are good enough for many applications but definitely show their limitations frequently.

It is likely that we will eventually see good fine tunes for Llama 30b that produce usable output for code/other challenging domains, but until we get GPUs with 48g+ VRAM we're going to have to make do with general models that aren't great at anything, and fine tunes that only do one very narrow thing well.


Ambiguous title, it should be "memory requirements for training LLMs"


Thanks. This is why I read HN comments before the article :)


this was the deepest dive we ever did into Transformers Math 101 (https://news.ycombinator.com/item?id=35631546). we mostly cover AI Engineer/inference time things on the pod, but training knowledge is hard won and extremely valuable, and I love it when experts are wiling to distill the rules of thumb for what they have learned.

questions welcome!


Sorry for the criticism here, but your linked paper / article does nothing to explain the math behind transformers. You should re-name it something like 'Scaling Transformer Mathematics' or 'The Math Behind Scaling GPUs / Transformers' or whatever more descriptive name there is to describe what you're outlining in your article. 'Transformer Math 101' should at least explain 1) input embeddings 2) (keys, queries, values) and the idea behind the matrix multiplication of these linear / numeric sets of values 3) the Softmax computation in the original paper as well as the relevant matrix transformations which take place 4) dot products ( and how they relate to similarity scoring) 5) feed-forward neural networks and the gradient descent / backpropagation and how they work. There are many many concepts you didn't even touch upon. This is not 'Transformer Math 101' by any means.


fwiw i'm not the author of that doc, we just interviewed him, and the hn submission i linked to was also renamed presumably for similar concerns. we do have an "Algorithms 101" episode (in the theme of our Datasets 101 and Benchmarks 101 episode) where we have at least some of your topics lined up


related online discussion when this article came out:

from Eleuther: https://twitter.com/AiEleuther/status/1648782486736969728?s=...

from AI anon: https://twitter.com/nearcyan/status/1662937711156625408?s=20

Stella Biderman (coauthor and now lead of the Eleuther Foundation, Stella if you're reading this pls come on our pod sometime!!!!) https://twitter.com/BlancheMinerva/status/164878587828883456...


I'm quite busy right now, but maybe in late September or October?


Yes! I have a 5 pages doc of prep notes I made, happy to share :)


Please do! Thanks


Why is the constant in the formula 6? In the transcript I see a mention of "6 tokens per parameter", but I'm not clear on why that is.


It's 2 for forward pass, 4 for backward pass (2PD + 4PD = 6PD), but we didn't cover why it's 2 and 4 since it was a very deep rabbit hole. Maybe he'll do a separate post for that in the future!


If you want a speedrun explanation for how we get to "2": In the limit of model scaling, context size doesn't matter (yes, forget about the quadratic attention), most of the compute is in the linear layers, which boil down to matrix multiplies. Consider a single matrix of size [T,d] multiplied by weight of size [d,d], the compute needed for a matrix multiplication is approximately 2Td^2 (2 coming from multiply + add). Swap T out with D for your whole dataset in tokens, d^2 is the number of parameters in a single linear layer so scale up your model to P, and you've got 2PD.

Even shorter: The 2 comes from the multiply-add


There's a breakdown here for anyone interested (ctrl+f "weight flops for")

https://medium.com/@dzmitrybahdanau/the-flops-calculus-of-la...


if you click through to the source doc and we talk about it on the pod, its basically 2 for forward pass, 4 for backward pass


These headlines read like owning a Tamagotchi virtual pet. Will it become popular to have ones own LLM, name it, and battle others!?


How feasible is it to train a toy model / very niche model on high end consumer hardware. Ie 24gb gpu?

From scratch not tuning


It depends on the number of parameters you're going for - a billion parameters is going to be out of reach from scratch, but you could train some in the millions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: