Look i'm not going to say the transformer is as efficient as the brain but you are not starting from Zero.
any Code LLM will be learning language and code and everything else, with absolutely no predisposition to either.
Your brain is baked with millions of years of evolution with specific areas already predisposed to certain types of processing before you ever utter a word.
The training process is finding local minimums based around an initialization vector of random numbers. 1000s of years of evolution probably mean you were initialized better than a baby AI using a pseudo random number generator.
Watching my kid learn to speak made it clear to me that we are pre-wired for the acquisition of language - we're not building the structures from scratch, we are learning to put words into the empty spaces. Probably natural evolution at best - no speech, lower chance of gene propagation
> Your brain is baked with millions of years of evolution
Exactly, and not only that, we are agents from birth, so we enjoy the 5 E's: "Embodied, Embedded, Enacted and Extended into the Environment"
LLM's can't even write a script to see if it works, they got no feedback and can't make any meaningful choice with consequences. They are trained with "teacher forcing" and self-supervised methods, no deviation allowed.
On the other hand we got THE WORLD which is infinitely more rich than any simulation, and human society which is made of super-GPT agents, and search based access to information.
Remember, most LLMs work closed book and train on a dry, static dataset. Don't directly compare them with humans. Humans can't write great code top-down without computers either. We are trial-and-feedback monkeys, without feedback we're just no good.
>Look i'm not going to say the transformer is as efficient as the brain but you are not starting from Zero.
Still, we can rewrite the parent's argument as:
If we train an AI on the amount of non-code-related dataset (writing read/speech heard) I've consumed, and then add to it all the amount of code-related writing/speech (coding book, coding lessons taken, code and manual pages read, man pages, etc.) I've consumed, would it even remotely as good as coding as me? Or even as good as itself is now?
I'd guess no. It's less effiecient, and thus needs way more coding dataset to get the point of coding than a human. Which brings us to:
>Your brain is baked with millions of years of evolution with specific areas already predisposed to certain types of processing before you ever utter a word.
Isn't that the whole point the parent is making?
That our advantage isn't about dataset-volume, but architecture.
The closest biological equivalent to a parameter in an ann is a synapse. Well humans have about 100 trillion synapses. We already know that the higher the parameter count, the lower the training data required. a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.
Current LLMs are actually nowhere near the scale of the human brain, either in parameters/neurons or training data (all the text we've ever trained an LLM on would be dwarfed by all the sense data humans perceive), as well as not having the headstart the human brain has.
It's a bogus comparison when you really think about it.
You could easily make the case that LLMs are far more efficient.
It is a bogus comparison, because typical language models used already have textual representations, and output textual representations; which is a very small fraction of 'the brain'. An astounding portion of the brain's neurons really go towards proprioception, smell, taste, motor function, etc. Which are not at all even slightly part of most models today.
Wernicke's area, a small sliver of frontal cortex, and a dash of dopaminergic circuitry is maybe the best 'coverage' made in these models if you want to be exceedingly facile/ham-handed about the analogies here. That's a very small portion of cortex, and much closer than what you may think in terms of capability/ TEPS and unit count.
I certainly recognize that possibility; but I also realize that systems can be extremely useful, and have a great 'understanding' (for some definition of 'understanding') of linguistic and visual data, without any need for 'sentience', 'conscious', or any other completely ill-defined ideas anyone wants to throw around.
There are a few cases where overlaps in sensory cortex above visual, audio, and linguistic processing (the main systems every decent AI already has as inputs, which are a very small fraction of the brain) would be very helpful, but clearly not absolutely necessary, in improving the capability of a world model - for example, know that a metal container half full water will slosh differently than a full or empty one. That requires proprioception, motor skills, as well as visual inputs etc. So cases such as this will be slightly less performant, but they're typically not relevant for tasks we are interested in automating.
Yes, I don't think consciousness can exist without feedback and delay. You can still experience a motionless room, but that's because the previous "frames" are still bouncing around in your brain. If you remove the historical replay (delayed feedback), it's something else.
It's not clear how many classical calculations a single human neuron is equivalent to. There's a strong analog component in multiple domains (strength, frequency, timing) and each neuron can connect to up to 15,000 other neurons. Assuming the brain's neurons are (probably unrealistically) fairly 'digital' we get an estimation of the human brain being equivalent to 1 exaflop (this is the currently accepted lower bound, and rather disputed as being too low). Current TPUv4 pods currently provide approximately 9 exaflops. I don't think we're currently reaching human-level learning rates. There’s currently no accepted “upper bound” on estimates of FLOP equivalency to a human brain.
> Though we have been building and programming computing machines for about 60 years and have learned a great deal about composition and abstraction, we have just begun to scratch the surface.
> A mammalian neuron takes about ten milliseconds to respond to a stimulus. A driver can respond to a visual stimulus in a few hundred milliseconds, and decide an action, such as making a turn. So the computational depth of this behavior is only a few tens of steps. We don't know how to make such a machine, and we wouldn't know how to program it.
> The human genome -- the information required to build a human from a single, undifferentiated eukariotic cell -- is about 1GB. The instructions to build a mammal are written in very dense code, and the program is extremely flexible. Only small patches to the human genome are required to build a cow or a dog rather than a human. Bigger patches result in a frog or a snake. We don't have any idea how to make a description of such a complex machine that is both dense and flexible.
> New design principles and new linguistic support are needed. I will address this issue and show some ideas that can perhaps get us to the next phase of engineering design.
> Gerald Sussman Massachusetts Institute of Technology
My understanding is that TEPS were used to determine computing for these types operations, rather than FLOPS, as they were more useful specifically for that comparison. There metrics put them in the same order of magnitude; however, as stated before, these miss the point by quite a bit, since much of the 'computations' humans do are quite irrelevant (taste, smell, etc) to producing language or solving algorithmic problems, etc.
For example, the cerebellum is 50-80% of what people keep quoting here (Number of neurons in the brain) and is not activated much in language processing.
Wernicke's area spans just a few percent of the cortical neurons.
The amount of pre processing we do by providing text is actually quite enormous, so that already removes a remarkable amount of complexity from the model. So, despite the differences between biology and ANNs, it's not unreasonable what were seeing right now.
Look this is great thinking... I don't want to diminish that but think of a brain like an fpga (parallel logic) not a synchronous chip with memory and fetch decode execute style steps....
We do things in a massively parallel way and that is why and how we can do things quickly and efficiently!
You run into the typical neural net problem with this logic. OpenAI (or at least Sam Altman) have already publicly acknowledged that the diminishing returns they're seeing in terms of model size are sufficient to effectively declare that 'the age of giant models is already over.' [1] It seems many people were unaware of his comments on this topic.
Neural networks in literally every other field always repeat the exact same pattern. You can get from 0-80 without breaking a sweat. 80-90 is dramatically harder but you finally get there. So everybody imagines getting from 90-100 will be little more than a matter of a bit more compute and a bit more massaging of the model. But it turns out that each fraction of a percent progress you make starts becoming exponentially more difficult - and you eventually run into an asymptote that's nowhere near what you are aiming for.
A prediction based on the typical history of neural nets would be that OpenAI will be able to continue to make progress on extremely specific metrics, like scoring well on some test or another, largely by hardcoding case-specific workarounds and tweaks. But in terms of general model usage, we're unlikely to see any real revolutionary leaps in the foreseeable future.
If we see model accuracy increase I'd expect it to be thanks not to model improvement, but instead by doing something like adding a second layer where the software cross references the generated output against a 'fact database' and regenerates its answer when some correlation factor is insufficiently high. Of course that'd completely cripple the model's ability to ever move 'beyond' its training. It'd be like if mankind was forced to double check that any response on astronomy we made confirmed that the Earth is indeed the center of the universe, with no ability to ever change that ourselves.
There is an argument that Altman's statement is just trying to distract competitors from outspending OpenAI. Prior to GPT-4 there was no indications that there are diminishing returns (at least on a log scale).
The tremendous progress over the last year makes me vary of your statement that progress will stop coming from model size improvements.
I don't think it is implausible. If engineers come to management at Google and ask for 4 bn to do a moonshoot 6 month AI training run, then such a smoke screen statement can be highly effective. Even if they delay their plans for 4 weeks to evaluate the scaling first, it is another 4 weeks headstart for OpenAI.
Also not everyone can bring 500m and more to the table to train a big model in the first place.
> tremendous progress
There are things which just seem to scale and others which don't. So far it seems that adding more data and more compute don't seem to flatten out that much.
At least we should give it another year to see where it leads us.
>Sam Altman) have already publicly acknowledged that the diminishing returns they're seeing in terms of model size are sufficient to effectively declare that 'the age of giant models is already over.'
He never said anything about technical diminishing returns. He's saying we're hitting a wall economically.
The Chief Scientist at Open AI thinks there's plenty of ability left to squeeze out.
Economics was not hinted or implied in any way. Diminishing returns on model size doesn't mean there's nothing left to squeeze out, it just means that what gains are made are going to be in model refinement, rather than going the NVidia vision of a quadrillion weight system and expecting large, or even linear, gains from that hop up in model size.
> Well humans have about 100 trillion synapses. We already know that the higher the parameter count, the lower the training data required.
Do you have any reference to back this claim, because it sounds is very curious to me. My understanding was pretty much the opposite, that current LLM technology require a bigger training set as you increase the parameter count. I'm no NN expert in any way though.
>It also increases the cost by a lot, so it's not a no-brainer at all.
Okay?.. Parameter size increases also increase cost a lot. Far more than more training data. Costs that stay well beyond training. Training on 1T tokens vs 500b won't change how resources it takes to run. Not the cases with parameter sizes.
>If they could beat the state of the art with only a fraction of the training cost, I suspect that they'd do so…
Not sure what this has to do with anything lol
>This is the claim you're making, but it's not substantiated.
I'm sorry but can you perhaps just read the paper sent ?
Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
> Okay?.. Parameter size increases also increase cost a lot. Far more than more training data.
Yup, and that's why lots of work goes into smaller model trained beyond the Chinchilla-optimality. But increasing the model size alone doesn't seem to make sense to anyone for some reason.
> I'm sorry but can you perhaps just read the paper sent?
I did skim it, and it's not making the claim you are.
> Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.
That a big model trained with enough data can beat a smaller model on the same data isn't the same claim at all.
>But increasing the model size alone doesn't seem to make sense to anyone for some reason.
It's not Economically viable or efficient to just scale model size.
>This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.
Literally this is what I said
>a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.
A 400b dataset is not the same training data as a 10b dataset
> We already know that the higher the parameter count, the lower the training data required
And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.
Also, even this other assertion
> a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.
is unsupported in the general case: will it be the case if both were trained on 10b Token? They'll both be fairly under-trained, but I suspect the performance of the biggest model would suffer more than the small one.
AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans, and the paper you quoted sure isn't backing this original argument of yours.
> We already know that the higher the parameter count, the lower the training data required
>And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.
They follow each other. If you have a target in mind, it's the same thing in different words.
>AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans
I didn't say it was a given. And in my original comment
, I say as much.
Also Object recognition leads to abstraction. Motion perception to causality. Proprioception is a big part of human reasoning. We're not trained on only millions of tokens. And our objective function(s) are different.
>Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".
That paper does show evidence of diminishing returns, for what it’s worth. You get less going from 64 to 540 than you do from 8 to 64. Combined with the increased costs of training gargantuan models, it’s not clear to me that models with trillions of parameters will really be worth it.
>a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.
I'm not so sure. I'm pretty sure there are diminishing returns at play after some point.
Plus haven't we already seen models with much less billions of parameters perform the same or very close to ChatGPT with had a much higher count (Llama and its siblings)?
>a 50 billion parameter model will far outperform a 5 billion one trained on the same data. and a 500b one would far outperform that 50 billion one.
I'm not so sure. I'm pretty sure there are diminishing returns at play after some point.
We can speculate about just how far this scaling can go or how far is even necessary but all i've said there is true. We have models trained and evaluated on all those sizes.
>Plus haven't we already seen models with much less billions of parameters perform the same or very close to ChatGPT with had a much higher count (Llama and its siblings)?
Only by training on far more data. Llama 13b has to be trained on over 3x more data just to reach the original GPT-3 model from 2020 (not 3.5).
>We can speculate about just how far this scaling can go or how far is even necessary but all i've said there is true. We have models trained and evaluated on all those sizes.
The part about "far outperforming", which is the main claim, is wrong though. We saw models much smaller being developed that fare quite well, and are even competitive, with the larger ones.
You already said "only by training on far more data", which is different than "more parameters" being the only option.
>You already said "only by training on far more data", which is different than "more parameters" being the only option.
I never said more parameters was the only way to increase performance. I said the training data required to reach any arbitrary performance x reduces with parameter size.
It's literally right there in what I wrote.
>a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.
This is why all the Alpha%wildcard% projects use self play, that along with a fitness function, or score maximizer generates their training data rather than have to show it records of games. You just let it play the game.
Not all of them. AlphaStar for StarCraft was largely supervised learning, to mimic top human players. Once it was sufficiently mimicking existing replays, then it switched to Reinforcement Learning.
That's why you see it do "nonsensical" things like destroying the "unbuildable plates" at the bottom of its natural ramp. This 100% wouldn't happen if it had only learned through self-play.
That’s an arbitrary line to make though. The human brain starts with a general architecture that has been gone through billions and trillions of generations of evolutionary training, before being fine tuned in a single individual over decades, and then you did a little bit of fine tuning and few shot at the end and claim that is comparable to the entire training of a LLM from scratch? Not to mention the many more orders of magnitude of neurons that a human brain has. I could equally argue that an LLM takes zero training, since we have ALREADY trained models and I can just copy the model and run it and get a new “brain” performing a task, instead of taking decades of training to get there.
Even your statement about programming skills is debatable, it depends on how you measure programming skill. They certainly are faster at it, and they know more computer languages than most people have even heard of. In fact, human programming strength seems to be more about general logic and planning skills over programming-specific skills, both things where the bulk of training happened evolutionarily and more generally over the course of a life.
The truth is, the two are not directly comparable. They are completely different architectures, at completely different scales, with entirely different strengths and weaknesses.
I feel like there should be a basic 'curriculum' that gets passed to all foundational LLMs that teaches them the basics of language. Maybe 100 million files where the first 10 million are all first grade reading level, the second 10 million are all second grade reading level, etc.
Ideally this includes a bunch of text books. That should give the LLM time to grok language before it starts training on more difficult texts.
any Code LLM will be learning language and code and everything else, with absolutely no predisposition to either.
Your brain is baked with millions of years of evolution with specific areas already predisposed to certain types of processing before you ever utter a word.