>It also increases the cost by a lot, so it's not a no-brainer at all.
Okay?.. Parameter size increases also increase cost a lot. Far more than more training data. Costs that stay well beyond training. Training on 1T tokens vs 500b won't change how resources it takes to run. Not the cases with parameter sizes.
>If they could beat the state of the art with only a fraction of the training cost, I suspect that they'd do so…
Not sure what this has to do with anything lol
>This is the claim you're making, but it's not substantiated.
I'm sorry but can you perhaps just read the paper sent ?
Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
> Okay?.. Parameter size increases also increase cost a lot. Far more than more training data.
Yup, and that's why lots of work goes into smaller model trained beyond the Chinchilla-optimality. But increasing the model size alone doesn't seem to make sense to anyone for some reason.
> I'm sorry but can you perhaps just read the paper sent?
I did skim it, and it's not making the claim you are.
> Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.
That a big model trained with enough data can beat a smaller model on the same data isn't the same claim at all.
>But increasing the model size alone doesn't seem to make sense to anyone for some reason.
It's not Economically viable or efficient to just scale model size.
>This has nothing to do with your claim that “We already know that the higher the parameter count, the lower the training data required”. To back such a claim we'd need a 540b model trained on 10b token beating / rivaling with a 8b parameters trained on 400b. I'm not aware of anything like this existing today.
Literally this is what I said
>a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.
A 400b dataset is not the same training data as a 10b dataset
> We already know that the higher the parameter count, the lower the training data required
And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.
Also, even this other assertion
> a 50 billion parameter model will far outperform a 5 billion one TRAINED ON THE SAME DATA.
is unsupported in the general case: will it be the case if both were trained on 10b Token? They'll both be fairly under-trained, but I suspect the performance of the biggest model would suffer more than the small one.
AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans, and the paper you quoted sure isn't backing this original argument of yours.
> We already know that the higher the parameter count, the lower the training data required
>And if you scroll up a bit, you'll see that this was the assertion that I've been questioning since the beginning.
They follow each other. If you have a target in mind, it's the same thing in different words.
>AFAIK, there's no reason to believe that the current architecture of LLM scaled to 100 trillions of parameters would be able to be trained efficiently on just a few millions of token like humans
I didn't say it was a given. And in my original comment
, I say as much.
Also Object recognition leads to abstraction. Motion perception to causality. Proprioception is a big part of human reasoning. We're not trained on only millions of tokens. And our objective function(s) are different.
>Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".
That paper does show evidence of diminishing returns, for what it’s worth. You get less going from 64 to 540 than you do from 8 to 64. Combined with the increased costs of training gargantuan models, it’s not clear to me that models with trillions of parameters will really be worth it.
It also increases the cost by a lot, so it's not a no-brainer at all.
If they could beat the state of the art with only a fraction of the training cost, I suspect that they'd do so…
> The point is that for any arbitrary performance x, the data required to reach it reduces with size.
This is the claim you're making, but it's not substantiated.