Hacker News new | past | comments | ask | show | jobs | submit login
DeepMind’s New Language Model, Chinchilla (marktechpost.com)
229 points by georgehill on April 11, 2022 | hide | past | favorite | 142 comments



Off-topic to Chinchilla, but relevant to the source site: MarkTechPost consistently borderline plagiarizes articles and shares them on their website as "paper summaries". They copy-paste from the source material and change some of the wording around as to appear original. My work, as well as other work from Berkeley AI Research, has been posted in this manner on their site.

This seems highly unethical, and I'm surprised how they continue to operate.


To add to this - they do this regularly, multiple times per week. While they do link to and acknowledge the source work, they do not make clear their writing is quoted or nearly quoted.


Fill out a DMCA notice:

https://abuse.cloudflare.com/

Cloudflare will forward it to their host, I believe, who will then ask that they remove the infringing material, or provide a counter claim.


IANAL but I'm pretty sure you can only do that if you own the copyright that is being infringed upon.


I don’t know about this site, and I agree its unethical. But it does make me realize that I much prefer using language of the paper directly as opposed to having a non-expert poorly translate what your paper said. Especially given how papers put a lot of time in the accuracy and specificity of their language and word choices.

Would it also annoy you if they screwed up the interpretation of what you wrote? Is the alternative less reach of your work? For hard core research the tradeoffs are tougher it seems. If it is just a matter of non-nevermind, thats strictly messed up.


It's okay to directly quote, which is what quote marks are for, with proper attribution, of course.


Thanks for the heads up! In that case, I'd prefer not to share this link with peers. Do you have an alternative source with similar high-level content to share?


Tough to say. Technically https://arxiv.org/pdf/2203.15556.pdf has the same content, it just isn’t highlighted the same way.


OP here - Thanks for sharing. I wasn't aware of this but despite this behavior, they are getting 600k visits.

https://www.similarweb.com/website/marktechpost.com/#overvie...


We better get used to it. Because news companies will say an AI wrote it. No law allows suing an AI for plagiarism. Go prove something is not an AI.


No one sues the car, the dog or the children, but the owner, responsible, parent, etc.


Seems the link is down. Found a decent synopsis/discussion on lesswrong.

https://www.lesswrong.com/posts/midXmMb2Xg37F2Kgn/new-scalin...

> On March 29th, DeepMind published a paper, "Training Compute-Optimal Large Language Models", that shows that essentially everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been training large language models with a deeply suboptimal use of compute.

> Following the new scaling laws that they propose for the optimal use of compute, DeepMind trains a new, 70-billion parameter model that outperforms much larger language models, including the 175-billion parameter GPT-3 and DeepMind's own 270-billion parameter "Gopher".


I think there remains an immense amount of such suboptimality still hanging from the tree, so to speak.

For example, our recent paper "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer"[1] shows that even learning rate and initialization used by existing models are deeply wrong. By just picking them correctly (which involves some really beautiful mathematics), we can effectively double the model size of the GPT-3 6.7B model (to be comparable in quality to the 13B model across the suite of benchmark tasks).

Large neural networks behave in a way we are only beginning to understand well just because each empirical probe of any such model is so much more expensive and time consuming than typical models. But principled theory here can have a lot of leverage by pointing out the right direction to look, as it did in our work.

[1] http://arxiv.org/abs/2203.03466


What do you think about the concept of "critical batch size"? https://openai.com/blog/science-of-ai/


I think the concept makes sense. The basic insight, that the right batch size depends on the difficulty and noisiness of a task, is already used by teams. For example, the PaLM paper from last week increased its batch size throughout training.

But as far as I know, the more precise predictions of optimal batch size aren't used much, probably because it's expensive to measure accurately, or because the predictive equation isn't accurate enough to begin with. I wonder if we can "transfer" the optimal batch size from a smaller setting (smaller model or data) to the full setting, like in our paper. This would make it much more practical.


According to the LessWrong post, the smaller model trained on more data performs better on most of the tasks, but it’s worse on “college level math” questions. I wonder why that is. Is it because the extra capacity of the larger model was used to basically memorize theorems? Or is it because the extra “brain power” let it model the math better? Oddly, one of the tasks that the smaller most outperformed the larger model on is “high school level math”! Very counterintuitive, and I am curious if there are any big takeaways lurking in that disparity.


Gwern responded to a similar question in the comments section.

(parent)

> the fact that data and compute need to scale proportionally seems… like a big point in favor of NNs as memorizers/interpolators.

(gwern)

> Surely it's the opposite? The more bang you get out of each parameter, the less it looks like 'just' (whatever that means) memorization/interpolation. When you needed to increase parameters a lot, disproportionately, to cope with some more data, that does not speak well of abstractions or understanding. (If I can train a 1t model to get the same loss as what I thought was going to take a 100t model, why would I think that that 100t model must be memorizing/interpolating less?) Let's take your claim to its logical extreme: suppose we discovered tomorrow a scaling law that made parameters near-constant (log, let's say); would that not suggest that those parameters are super useful and it's doing an amazing job of learning the underlying algorithm and is not memorizing/interpolating?


This isn’t addressing their question. And Gwern’s goal here is to (incorrectly) try to get rid of the idea that models are just memorizing and interpolating, when in fact memorization and interpolation is what we all do, including models. He’s just bothered by the idea that people think of models as less than magic.

On the other hand, https://twitter.com/model_mechanic/status/151297688118364569... is admittedly pretty magical, even if the basis of that magic is memorization and interpolation.


To rebut someone's argument you must address the argument and not just talk about them and their motivations

From your comment a reader will understand that you think they're just memorizing and interpolating and that you disagree with gwern on this point, but you've given your reader nothing that argues in favor of your position

Why should someone believe that models are just memorizing and interpolating?


It's impossible for a piecewise linear function to be anything other than linear outside the training sample. They are by their definition unable to do anything but interpolate.


(Side note: Transformers aren’t piecewise linear. The dot products are bilinear, and feeding in the same input twice (under different linear maps) into a bilinear map, produces a quadratic map, not a linear one.)

People arguing about this are basically all speaking ambiguously, in ways that tend to either make an apparent disagreement when there is none, or hide the location of the actual disagreement.

It is true that a piecewise-linear function, within any linear component, will have any convex combination of some points be sent to the corresponding convex combination of where it sends those points.

It is not true that a piecewise-linear model trained on a set of data points will produce only outputs which are a convex combination of outputs that appear in the training set.

These are both obvious.

If one person takes the former to be what “it just interpolates between points” means, and another takes the latter to be what “it doesn’t just interpolate between points” means, and then they argue about which of them is right, then both are being silly.

I’m not saying that this is literally what is happening. This is meant as a metaphor for somewhat more sophisticated/reasonable interpretations of “(doesn’t) just interpolate(s) between points in the data set”.

_____________

A model trained on images which produced only convex combinations of images in its training set, would clearly be producing what could be called “interpolations between images in its training set”, and taking convex combinations of images is unimpressive.

This is obviously not what today’s image ML models do.

And, of course, you aren’t claiming that they do.

______

I should speak plainly.

Much of where the disagreement is, or is hidden behind, is disagreement as to the meaning of “just interpolation”.

At one end, “just interpolation” could refer to “take the Voronoi cells of the inputs in the training set (or maybe the dual of it, whatever), and at runtime, find the nearest neighbors of the point and take the linear combination of their assigned outputs, weighted according to the distances to the point.” This would certainly be “interpolation”, and is not impressive, calling it “just interpolation” seems quite fitting. However, it is obviously not what ML models do.

On the other end of the scale, “interpolation” could be interpreted as meaning “any process whatsoever, except that the process is required to be based primarily on the training data, with the process being generated mechanically from the training data, of computing an output for a given input.” And, certainly today’s ML models satisfy this description, but with this description the moniker “just” seems, inappropriate. It is like saying “just a process”. Well, yeah, everything is a process.

__________

It seems to me like much of what the disagreement ought to be about (which might not be what it is about) is along the lines of, how many conceptual layers of something are captured? Like, say something modeled images of faces as “linear combinations of images from this list of images of faces”. That’s an extremely basic thing. Then, very slightly more, would be something that determines positions of facial features, and then does stretching etc. of images to make them line up with the image to reproduce, and then does linear combinations. Then, suppose something takes like, the parts of the images of the face which are just skin and not like lip skin or eyes, and takes local averages of this in a number of general locations of the face (relative to locations of facial features), and takes principal components of this across the training set (with principle components perhaps corresponding to perhaps, 1 or 2 for skin tone, and then directionality for the lighting in the image, and maybe a component for how shiny the skin is).

A model which represents a face in terms of variables which we can interpret as things like “position of eyes”, “skin tone”, “lighting”, seems notably less in the direction which one might call “interpolating” than one which just lists a coefficient for each image in the training set (or each principal component of images (taken as plain vectors) in the dataset). And, of course, one can go farther than this in this direction. And the further one goes in this direction, (so, like, the more that what the individual images in the training set tell the model is “here is more data about an overarching pattern”), the less it seems like what one might be inclined to call “just interpolation”.


>It is not true that a piecewise-linear model trained on a set of data points will produce only outputs which are a convex combination of outputs that appear in the training set.

No, and I didn't claim that. I said that, outside the training sample, the model is linear (or quadratic in the case of transformers, thanks for pointing that out) Whether linear or quadratic, a model that has a fixed structure outside the training sample, will obviously not fit data which lies far away from the training sample - i.e. it will not extrapolate. This isn't controversial - it's just something people like to forget about.

>A model trained on images which produced only convex combinations of images in its training set, would clearly be producing what could be called “interpolations between images in its training set”, and taking convex combinations of images is unimpressive.

True! I should have clarified that it's not linear interpolation in pixel space (or input space generally), but interpolation on the latent manifold. This is where the power, as well as limitations of deep learning come from. It's definitely non-trivial to identify the latent manifold of data - different dimensions of the manifold may sometimes even correspond to independent components, as you mention (position of eyes, skin tone,...) (though empirically, finding disentangled latent codes is mostly a function of the random seed).

How does an NN process a new input? It maps the input to the latent manifold.

In the input space, it will be some highly non-linear, non-trivial combination of points, which in terms of Euclidean distance in the input space, could be arbitrarily close or far away.

In the latent space, the output will be some convex combination of nearby points.

Here's the kicker - even if your problem happens to be well-modeled as a continuous, low-dimensional manifold embedded in a high-dimensional space (and many, many problems aren't), and even if you manage to obtain a super dense sampling of input space, so that the manifold can be well-approximated (which is impractical or impossible for most problems),

you will never be able to generalize beyond the data distribution.

Our brains don't stop working as soon as conditions are slightly different from what we've seen before. If there's a slight fog on a Stop sign, we can still see a stop sign. If the Go board is 9x9 rather than 19x19, we can still play Go. If we can play Starcraft on one map, we're pretty much as good on a different map, we don't need to relearn the game over the next several thousand years.

How come? Because we aren't just latent space interpolators. We can extrapolate.


Why do you say they just memorize and interpret? I can teach GPT-2 new things, including new objects and their physical properties and it does a good job with that. That also means it has definitely not just regurgitated a matching sentence back to me.


when i see a new object for the first time, i MEMORIZE what i INTERPRET as its identifying traits, and ask someone who has already MEMORIZED what that object is to INTERPRET a concept with which i can associate those traits. the next time i encounter an object with those traits i can then recall the associations, then compose those trait-level interpretations into an interpretation of an object.

at a fundamental level that's all this is, compositions of associated memorizations and interpretations, which map to compositions of sentence parts the machine can regurgitate back to you


That's a bit of a reductionist way of looking at any sort of learning; I'm not sure how it's helpful to use memorize/interpret in terms of distinguishing what language models do compared to other types of learning.

I might be missing the point though?


Probably right. Most people dump on these language models for this reason but it would be absurd for a HS student to have to re-derive the quadratic equation every time they worked on an Algebra problem so naturally you memorize it. Why should it be any different for a language model?


I never memorized the quadratic formula, and I did OK.


Did you go to school in the US in the last 2-3 decades?


No, the UK. But completing the square only takes a minute.


Try completing the square when the coefficients are variables :)


Once you start calculus they let you use a real calculator


That may be true, but in the US there are typically math courses before calculus.


But then we get a calculator.


Even then, it is typically not a symbolic calculator so if your answer is a closed form function of variables, you're SOL with a TI-84.


Maybe we went to radically different schools but I certainly had to calculate by hand using the quadratic formula countless times where calculators were not allowed to be used.

Anyway it distracts from the point so it's not relevant.


It might just be by chance: the initial weights of one model could have been lucky in some areas, and unlucky in others. There's no way to tell other than training again, which is a costly proposition.


That seems pretty unlikely to me actually. As the models and training data get much bigger, I think the initial weights become less important (at least assuming your random weights have certain desirable statistical properties, which they do by construction usually).


70 billion parameters... Is each of those a 4-byte float?

So, is that 280 billion bytes of just parameters?


I'm fairly confident each of those is a 2-byte float, but yes that's over 100 GB of parameters.


Welcome to the party! I joined ML because I realized I could help. You can too. I bet you’re both already thinking of clever ways to deal with massive models from an infrastructure standpoint. That’s just one of hundreds of interesting problems.


Is 100GB of parameters really that large? 128GB of RAM on a server class machine is not unusual. Seems such a model could fit entirely in RAM.


To elaborate on the sibling comment: main memory is much bigger, but CPUs are much, much slower. It would be a challenge to merely run a model like this on CPU, and totally infeasible to train one. So the challenge is to fit into the memory of a single GPU you can afford, coordinate multiple GPUs, or efficiently page from main memory into GPU.


GPU memory is generally much smaller and more expensive


Is there any source which explains what billion of parameters actually are?

In my mind a parameter is: language, dialect, perhaps context parameters (food, dinner, lunch, travel) and if we than talk about language and audio perhaps sound waves, gender.

Or are context parameters which gives you insight? Like a billion of parameters are literally something like travel=false, travel-europe=true people speaking=e, age, height,


Parameters are just floating point numbers, at most they can be seen as degrees of freedom or kind of like the order of a polynomial used in curve fitting.

They're too abstract to assign much meaning to individual parameters, as our understanding of why their values are exactly the way they are is extremely limited.


A good visual introduction to neural networks can be found here: https://playground.tensorflow.org

A parameter is a "weight" in this case (the lines drawn from neuron to neuron). The neurons are effectively runtime values or "activations." Parameters (weights) are updated during training and then set as constant during "inference" (also called "prediction").

There's unfortunately a ton of jargon and different groups use different words almost exclusively.


A parameter is a scalar value, most of which are in the attention matrices and feedforward matrices, you also hear these called “weights”. Any intro to DL course will cover these in detail. I recommend started with Andrew Ng’s Coursera class on Intro to Machine Learning, although there may be better ones out there now.


Input parameter vs. weights then?

I see tx


These networks (text models) usually have around a few thousand inputs.


The parameters are the number of weights in a neural network, in this case.


It's rare a single parameter maps to a human understandable concept. Occasionally someone finds one that does map fairly well, for example this case back in 2017: https://openai.com/blog/unsupervised-sentiment-neuron/#senti...


Yes, parameters are usually stored as float32, activations as bfloat16 or float16.


This is exciting if only because as we discover more compute optimal models that out perform the behemoths that have been state of the art it opens up the ability for smaller independent groups to train and release their own versions, more fully democratizing AI. Looking forward to a group like Eluther or Hugging Face releasing a version of this.


>This is exciting if only because as we discover more compute optimal models that out perform the behemoths that have been state of the art it opens up the ability for smaller independent groups to train and release their own versions, more fully democratizing AI.

I think I support this in principle but it seems like the scaling curves keep going so it's easier to just make larger models with more data.

>Looking forward to a group like Eluther or Hugging Face releasing a version of this

Both of those groups have access to dozens if not hundreds of Cloud GPUs, I'd hardly call them small.

It would be impossible to replicate these models as say an independent researcher or even in an academic research group outside of maybe Stanford/Berkeley/MIT/etc. and I'd even doubt their ability to replicate models like this based purely on Cost alone.


Small is relative -- and to Google, Facebook and Microsoft they're positively tiny. Perfect is the enemy of good or some such and I think this is a move in the right direction even if I can't personally train this on my 3090.


They trained over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens while staying under a given compute budget. The results are modelled, and they pick the best one. Turns out the having a bit fewer tokens improves performance.


Thank you :)


Is outperforming GPT-3 still a good reference? It seems there are many models outperforming GPT-3 in the superglue benchmark: https://super.gluebenchmark.com/leaderboard/ GPT-3 is in position #21, with 71.8% score. The best model is at 91.2%. Note the human baseline in #6 with 89.8%


Note that this isn't an apples-to-apples comparison. The GPT-3 position is for a few-shot use-case that has not been trained for this particular task. When fine-tuned, GPT-3 would be expected to perform a lot better. Lastly, GPT-3 is currently operating on the text-002 models, and the 3rd version of GPT-3 is generally the one considered current. These benchmarks are for the original GPT3 model.


It's a good reference because people are familiar with GPT-3. The paper mostly compares Chinchilla to LaMDA, Jurassic, Gopher, MT-NLG, and GPT-3. In the broader tech industry and even to a certain extent within the AI field, GPT-3 is the only one that most people know by name.


Aren’t most of the models at the top not suitable for text generation? That’s what makes gpt different from Bert


What are the models at the top used for? Excuse my ignorance.


Mostly mask fill, but Transformers can be fine tuned to downstream tasks relatively easily (T5 was built for translation but is used for autocomplete in many cases)


would you mind sharing some references (or even just googleable terms) for this process of fine tuning?


> Is outperforming GPT-3 still a good reference?

It is if you outperform it with much fewer parameters


The design of the original Transformer model in the Attention is all You Need paper was predicated on efficiency (all layers the same size, combining word/token embedding with position in the input stream harmonic embedding). It is good to see improvements!


Cached version since the original is down (I'm assuming it's down due to load issues and not due to the author taking it down). https://webcache.googleusercontent.com/search?q=cache:PLSLy9...


Is there a good reference as to what a "parameter" is in this context? I've looked a few times, but the explanations don't make any sense to me.


It's a degree of freedom of the learnable model. For example in a "vanilla" neural network layer (MLP), which maps from M to N feature dimensions will contain an MxN matrix of learnable parameters that model the connections between the M inputs to the N outputs. Every time the model is updated during backpropagation, the loss gradient which has to be computed has the same dimensionality as the number of parameters. Also, generally more parameters means more operations in the forward pass. Therefore, a model with more parameters in general will require more FLOPs per iteration of training. The main point of this paper is that you can actually do better by training a smaller model for longer, rather than a bigger model for less time, assuming you have a fixed FLOP budget.


The other thing with more parameters is that it gives the NN more ability to overfit. That means that instead of, say, learning what a dog is, it instead memorises all the sentences containing "dog" that it has ever seen.


You can think of a parameter as a number you can tweak while training. This network has 70B such numbers.


And if every parameter is one byte, the minimum, it will take at least 70gb to save or share this model. So it's still way to big to package directly in a app.


From the paper, they are using bfloat16, so I guess two bytes. But distributing and "packaging into an app" are not at all of practical interest for these kinds of models. You (a consumer) would interact via some API service, with the model running on a hardware-accelerated compute cloud.

In any case, during training (where the model is run in possibly large batches), and even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations.


> even during inference, the size of the parameters is completely dwarfed by the intermediate tensor representations

What makes you say this?


It's especially true for models that do some kind of weight sharing, which is very common (CNNs, RNNs, transformers, etc). For a concrete example, consider a layer from an image convolutional network, which maps from a 3-dim colorspace to a 128-dim feature space. Assuming a 5x5 kernel that's about 10k parameters. However, after applying this layer, you go from having an (B,H,W,3) tensor to a (B,H-4,W-4,128) tensor, where H,W are the height and width of the image, and B is the number of images in the batch. If you're working with even moderately high resolution images, the memory required for these intermediate tensors at each layer is much larger than the parameters.

Something similar applies for RNNs (same weights applied at each element of a sequence), GNNs and transformers (same weights applied at each pair of data).


Have you seen modern games?


I doubt they load that amount of data in memory


I'm thinking about upgrading from 64gb to 128gb so i can use all my Cities: Skylines assets in the same map


Right, they usually stream assets as they are requested. Large models do the same.


If these things get put on specialized hardware for inference with much lower energy costs, the world will never be the same.


the biggest problem first of all might be the memory requirements given so many parameters. It couldn't be as cheap as a high end computer in the foreseeable future.


There is probably a space-time trade off that needs to be explored in this space. It might be possible to preload the some of the most likely tokens to be selected next into the cache and/or RAM. These are glorified auto-complete algorithms that are poorly understood, as DeepMind's optimizations appear to show. For the English language, it is probable that there are only so many possible grammatically correct selections for the next token, for example.


Glorified autocomplete? Autocomplete can guess the next word .. sometimes, GPT-3 goes hundreds of words ahead. On generic topics it can be hard to distinguish from human text.

And it can't cache tokens because all tokens are evaluated in the context of all the other tokens, so they don't have the same representations when they reoccur at different positions.


They're evaluated in the context of the last 2^n many tokens, for many models it is 1024, 2048, or 4096 tokens as a scanning window. The tokens (words and sometimes punctuation) are represented by integer values, so the last 2^n many tokens would certainly qualify for storage in a cache. Then next token selection only has so many possible assignable selections in any given language model because of grammatical limitations. This is only one such optimization, there could also be optimizations around the likelihood of certain words to be used given the presence of certain previous tokens, and so on.

But, yes, tokens are chosen one word as a time based on the previous content, similar to earlier auto-completion algorithms.


I’ve been saying this for years, language models are the ML equivalent of the billionaire space race, it’s just a bunch of orgs with unlimited funding spending millions of dollars on compute to get more parameters than their rivals. It could be decades before we start to see them scale down or make meaningful optimizations. This paper is a good start but I’d be willing to bet everyone will ignore it and continue breaking the bank.

Can you say that about any other task in ML? When Inceptionv3 came out I was able to run the model pretty comfortable on a 1060. Even pix2pix and most GANs fit comfortably in commercial compute, and the top of the line massive models can still run inference on a 3090. It’s so unbelievably ironic that one of the major points Transformers aimed to solve when introduced was the compute inefficiency of recurrent networks, and it’s devolved into “how many TPUs can daddy afford” instead.


Is that fair? My Pixel phone seems to run nothing but ML models of various kinds and they run locally which is madness, pure madness. It can recognize songs and my speech without talking to the cloud at all. That's pretty much the definition of optimization!


It's just about where the software development incentives are. Big shops have incentive to have service models. I think of it like a return to the mainframe days, and an old-IBM like mindset.

However the upside to pocket sized intelligence will eventually win out. It's just a question of when someone will scrape together the required investment.


Imagine any diffusion-style text-to-image model on specialized ASIC hardware.


That’s what an ANE/TPU is.

If you mean putting the model weights into gates directly, it’d be useless because users would get bored of the model as soon as they figured out what its style looked like. Also, large models can memorize their training data so eventually you’ll get it to output something copyrighted.


These models are definitely entering the space where no one could ever get bored of them, and many styles can be generated.


Does this imply we will run out of data to keep up with larger model sizes?

Is there much more data out there than what they’re already using?


It implies our models are wrong.

Consider that a human adolescence is ~9.46x10^6 minutes and a fast speaking rate is ~200words/minute. That sets an upper bound of 1.9 billion words heard during adolescence. ie: human adults are trained on a corpus of less than 1.9B words.

To some extent, more data can offset worse models, but I don't think that's the regieme we're currently in. GPT-3 was trained (on among other languages) 181 billion English words - or about 100 times more words than a human will hear by the time they reach adulthood. How is the human brain able to achieve a higher level of success with 1% of the data?

1. https://github.com/openai/gpt-3/blob/master/dataset_statisti...


> How is the human brain able to achieve a higher level of success with 1% of the data?

The most obvious answer is "the human brain uses a shit-ton more compute", for 18+ years as well.

We spend data, which we have in abundance, to save on compute, which we do not. Even at the most generous low-end estimates of the human brain's computing power, we are only barely there; on the high-end estimates that people in love with the ineffable mysteries of the brain love to cite, we are multiple orders of magnitude away from even the biggest supercomputers matching the brain. So no matter which way you slice it, we are extremely compute-poor.

Feeding a lot of data through an extremely lightweight optimizer like first-order SGDs is one way to cope with lacking compute: https://www.gwern.net/docs/ai/scaling/2013-bottou.pdf Bottou asks why (even in 2013!) is SGD so hard to dethrone when we can empirically see plenty of optimizers like second-order gradient descent algorithms which can beat SGD quite solidly? His observation is that while they are much better than SGD in terms of iterations or _n_, they lose in compute/wallclock because SGD can just go-brrrr through the data much faster than they can.


Yeah, there are ~100B neurons, ~1Q synapses, but how much compute is the brain actually using over time?

Some quick googling gives this:

- Generation of an action potential seems to use ~2.5×10^−7 J [0]

- The brain consumes around 20W during normal activity

This seems to imply that there are around 8×10^7, call it 10^8, activations per second [1].

Apparently, the average neuron has 1000 synapses. Let's say each synapse requires 10 mulacc operations per activation. Doing that math gives about 10^12 FLOPs/s [2].

Integrate that over 18 years, and you get roughly 5.7×10^20 FLOPs [3].

PaLM required 2.56×10^24 FLOPs to train [4]. So, we have (way more than) enough compute, we're just not using it efficiently. We're wasting a lot of FLOPs on dense matrix multiplication.

There's plenty of wiggle room in these calculations. I checked over the math, but I'd appreciate if someone would let me know if I've missed something.

    [0]: https://link.springer.com/article/10.1007/s11571-018-9503-3
    [1]: https://www.wolframalpha.com/input?i2d=true&i=Divide%5B20+W%2C2.5%E2%80%89%C3%97%E2%80%89Power%5B10%2C%E2%88%927%5D+Joules%5D
    [2]: https://www.wolframalpha.com/input?i2d=true&i=Power%5B10%2C8%5D+Hz+*+1000+*+10+flop
    [3]: https://www.wolframalpha.com/input?i2d=true&i=Power%5B10%2C12%5D+Divide%5BFLOP%2Cs%5D+*+18+years
    [4]: https://blog.heim.xyz/palm-training-cost/#:~:text=PaLM%20(2022)-,2.5e24,-10x***


There is a long history of connectionist attempts trying to ballpark the brain compute to constrain AI timelines, going back to von Neumann/Turing/Good. The most recent one would be https://www.openphilanthropy.org/brain-computation-report You can see in Figure 1 that your 10^12 steady state is the very low end. If you're interested in seeing where your envelope estimate differs from the others, well, it has the references.


My understanding is that the binding constraint in training these models is the quantity of computation they consume. While a human makes do with drastically less input data, we also have drastically more computational resources in our heads to work on the problem than Google is using to train its models.


Yeah, this implies backpropagation is deeply suboptimal.


That is certainly a possibility. The other (non-mutually exclusive) implications may also be that human language acquisition benefits from being part of a multi-task model. Or that the problem has been overreduced ie: human language acquisition cannot simply be distilled into a words-in->words-out problem and that vision/hearing are actually integral parts of language acquisition that cannot be left out. Or that model arch still has major improvements to be made and attention is not all you need, for example.


> and that vision/hearing are actually integral parts of language acquisition

Deaf-blind authors would beg to differ.

But yes, a human brain is exposed to lots of other sensory input, and we know from other research that multi-modal models can learn shared representations that benefit from the knowledge of each domain.

In Transformer's favor, at least, they are far closer to tabula rasa than the human brain is and likely have to dedicate a lot of their training time to things that are otherwise "baked" into human brains. For example, humans come pre-packaged with V1 and V2 as part of their visual system, but CNNs and ViTs have to learn those filter packages from scratch.

I agree with you though. Human brains are able to take single instances of experiences and build a wealth of understanding from them in ways that even modern Transformer architectures are not yet able.


It seems like internal language (thinking in language) is also a way our brains train themselves too? I’ve probably thought 100x more words than I’ve spoken.


This would map to a sort of semi-supervised approach. For a lot of problems this has shown to drastically reduce the data requirements, but can bump up compute.

All those conversations in the shower were actually regularizers!


There's a ton of data that can be exponentially more useful, but we'll need networks that can (analogously) be late to work enough times to get fired, or experience heartbreak in succession while misunderstanding why prior heartbreak happened, or hallucinate stray cats when they're walking around the neighborhood at night


Probably not an issue just yet, think of how much data is generated by Twitter on a daily basis for example.


If you want to teach your kid to learn English, and they came back to you and said "Dad/mum, I finished reading the entire internet but I still don't understand English fully", would you say "OK son, now go and stare at the Twitter firehouse until you grok perfect English" ?

It's clear that these models have orders of magnitude too much data already.

It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.


> It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.

I agree with your main point, but think this analogy isn't an apt one. If you want to see what particles are created at higher energies you kinda need the bigger particle accelerators. (This isn't to say that we shouldn't be investigating lower energy collisions, but at a certain point you do need "bigger colliders" to see new things)


The general point is that there is a huge volume of training data generated daily not that Twitter is a great source of it. Though I believe that GPT-3 for example was trained on the Common Crawl dataset which would contain both Twitter and Reddit.

>It's clear that these models have orders of magnitude too much data already.

Seems like a strange claim. The scaling laws are showing that you can still make gains with more data and more parameters.

>It somewhat reminds me of the proposals for larger and larger colliders in the hopes of seeing new physics that is always one collider in the future.

This is literally true though, couldn't find the Higgs without the LHC and most GUT candidates would only start being ruled out at high energy levels.


Common Crawl actually does not contain Twitter, you can go check the indexes https://github.com/ikreymer/cdx-index-client . Twitter is extremely aggressive about scraping/caching, and I guess that blocks CC. Models like GPT-3 still know a decent amount of Twitter material, and I figure that this is due to tweets being excerpts or mirrored manually in non-Twitter.com URLs (eg all the Twitter-mirroring bots on Reddit).


> Seems like a strange claim. The scaling laws are showing that you can still make gains with more data and more parameters.

But then we’ve given up on matching human intelligence which is all about working efficiently with small training data, and certainly training a human does not need anywhere near as much data as GPT-3.

GPT-3 was interesting as a proof-of-concept of what happens when you use a gigantic amount of training data. We don’t need a bigger one until we can figure out how to make a smaller one that is just as effective.

If scaling laws are telling us to keep putting even more training data into the thing, then the conclusion should be that the architecture is just not working out.


>But then we’ve given up on matching human intelligence which is all about working efficiently with small training data, and certainly training a human does not need anywhere near as much data as GPT-3.

I don't think we should really take so much inspiration from the brain. We didn't make airplanes work by building bird machines so why should we do that here.

>GPT-3 was interesting as a proof-of-concept of what happens when you use a gigantic amount of training data. We don’t need a bigger one until we can figure out how to make a smaller one that is just as effective.

This feels like a non sequitor. We can certainly keep making larger models and we will, because we can continue to make performance gains doing so.

>If scaling laws are telling us to keep putting even more training data into the thing, then the conclusion should be that the architecture is just not working out.

I don't think anyone in the field would agree to this point. Researchers see an easy avenue to gain better performance so they take it. Deepmind's model shows you can get similar results with more refined architecture, but this was released well after GPT-3. When teams significantly advance the state of the art with a much smaller model I think we should take notice but that hasn't happened yet.


> I don't think we should really take so much inspiration from the brain. We didn't make airplanes work by building bird machines so why should we do that here.

It’s not that we should mimic the brain’s implementation, but we should certainly strive to match the brain’s capabilities. One of its outwardly observable capabilities is that it is extremely efficient in the size of the training data set it requires.

Efficiency isn’t an implementation detail, it’s definitional to what “highly intelligent” means.

GPT-3 is not an airplane, it’s a zeppelin. Zeppelins also have scaling laws dictating that a zeppelin should be very very large. Building bigger and bigger zeppelins is one thing, justifying expending resources on gigantic zeppelins by stating the scaling law and concluding that a jet aircraft will magically pop out if you build a big enough zeppelin is quite another.


Your earlier analogy kind of feels like saying that because you can go further by adding more fuel to a jet engines fuel tank that you have failed at efficiency and should redesign the engine.

But generally I think the better analogy is a rocket ship. If we can still go higher and faster with more fuel we should try to do that before we worry about engine efficiency. You have to get to the moon before you can colonize the galaxy.


> It's clear that these models have orders of magnitude too much data already.

I have a toy disproof for your claim that this is clear.

Imagine that you are training a ML system using oracle access to Mum. The ML training system can request 10 million representative samples of Mum output, and then we could judge if the ML system has adequately reproduced Mum.

Now also imagine that Mum frequently tells people that Mum knows a 23 letter secret and while mum won't tell people what is outright, she'll answer queries like if a guess is lexographically higher or lower. We could even imagine that the ML has seen Mum's side of some interactions with her doing that.

Would the ML know Mum's secret? No.

Would a child that could interact with Mum? Yes-- after at most ceil(log_alphabet(23)) queries at most, if the child is efficient.

Learning in an interactive context is not the same as learning from written material, so you can't be sure that the fact that children learn english from less text means that a non-interactive ML system could english from the same amount. Q.E.D.

Now, if someone figures out how to efficiently train these natural language models with reinforcement learning...


I disagree with this take because you grok English not only from the text you read, but also from the context of physical world around you. And that context is enormous: assuming 8000x8000x2 vision with 3 color 1 byte channels at 24fps without compression, you get 3e+17 bytes (300 petabytes) of data along with your reading per year.


Blind children can learn english fine though. And there are areas highly unmaterial (mathematics) which people still reason about.


You ignored the point. I only brought sight as an example (though, admittedly, it is the largest data inflow).


Humans have to learn sight while learning speech though. I don't mean knowing that this is a dog, I mean making sense of noisy vision inputs.


So what? There's no indication that one hinders the other much, and may even improve generalization.


On the other hand, consider the difficulty of taking massive amounts of data from the modern web and filtering out the subset that was actually generated by humans, rather than previous generations of language models.


Definitely an interesting future problem. I'm sure OpenAI and others are thinking about it but I don't think these models are ubiquitous enough to have much impact just yet.


Some estimates:

- 500M tweets per day

- 30 words/tokens per tweet

- 40% of all tweets thrown away due to being duplicate/spam/bots

= 9B tokens generated per day


I understand I can query such a model, one query at a time. But are there way to query these models with several queries in a row such that the N+1-th query benefit from the knowledge that was used to answer the N first questions ? Basically, following a conversation. For example, youtube subtitles can badly translate some terms but if "it" had in mind the overall subject of the video, then it'd probably pick the correct word...


Yes. That's how you use GPT3: for the 2nd token, you feed in your prompt and the first token it returned. Then you feed it your prompt and the first two output tokens, and so on.


the term for this is autores dice models iirc


I'd love to take a language model, load it up, and train it on everything I write in online learning mode. Does one need some massive hardware to do online learning with these models instead of just running the distilled final models?


I have to ask, why call it that? I had a chuckle once I saw the name.


Its not so bad. If they were radio astronomers they'd call it Very Big Neuronal Language Model. IBM would call it Watson Advanced AI. If they were a gamer accessory company they'd call it DeepTek Ultra Pro VDH-Max AI A320M. Chinchilla is nice and fluffy.


Large language models have a (recent) history of silly names. BERT, BART, ELMO, RoBERTa, BIGBIRD, PaLM, Megatron etc. Might as well go full nonsense.


My theory is since no one reads literature anymore, timeless, interesting and unique names from history and other cultures are lost to a deluge of soon to be forgotten gag, pop-culture and meme names. Perhaps this is why we have Chinchilla and not Oberon.


Like the Oberon OS and programming language?


Image models too - the Inception paper from 2014 directly refers to knowyourmeme.com and the "we need to go deeper" meme from the movie Inception - https://knowyourmeme.com/memes/we-need-to-go-deeper - it's the first reference in the paper [1] and it's also why the model is called that way.

[1] https://arxiv.org/pdf/1409.4842.pdf


A touch of irony that cutting edge research on language can’t produce better names.


True. I will add that it is customary to justify it by demonstrating it is some sort of acronym or contraction.


It's a recursive, selective acronym

               C
              CH
             CHI
            CHIN
           CHINC
          CHINCH
         CHINCHI
        CHINCHIL
       CHINCHILL
  ==> CHINCHILLA
      HINCHILLA
      INCHILLA
      NCHILLA
      CHILLA
      HILLA
      ILLA
      LLA
      LA
      A


I know what recursive means, I know what selective means, I know what an acronym is, and I think I see the pattern in that picture, but when I put it all together I am lost.

Alternatively, is this a joke and the "recursive, selective acronym" can be used to justify any word?


               A
              AR
             ARB
            ARBI
           ARBIT
          ARBITR
         ARBITRA
        ARBITRAR
  ==>  ARBITRARY
       RBITRARY
       BITRARY
       ITRARY
       TRARY
       RARY
       ARY
       RY
       Y


Yup, seems it works for any word.


There were a lot of complaints about earlier models being named, say, 'Meena'. (It's very sexist, you know, to name a chatbot a female name.) People won't complain about 'Chinchilla' because chinchillas are adorable. PaLMs aren't adorable, but at least it's neutral.


It outperforms the Gopher model


Yet oddly enough I don't think chinchillas are smaller than gophers


Yeah, similar "thematic" naming to MacOS versions. I don't know why the original one was called Gopher, though.


Because it retrieves facts from memory in a way that’s analogized to a gopher retrieving objects.


It's the name of a town in QLD.


Can't wait for DeepMind to take a stab at outcompeting dall-e.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: