Chinchilla law is a rule of thumb that you should have 11++ x training tokens fo...

Chinchilla law is a rule of thumb that you should have 11++ x training tokens for every param

If not, you are getting diminishing benefits for each param you add

I’m extreme cases your model can even perform worse with more param due to lack of training data

More complicated: the quality of the data matters as well

So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task

Or go big on everything (aka openAI) which requires monster GPU time for every reply token

There are obviously in between as well. Hence why the question is so loaded

Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one