If not, you are getting diminishing benefits for each param you add
I’m extreme cases your model can even perform worse with more param due to lack of training data
More complicated: the quality of the data matters as well
So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task
Or go big on everything (aka openAI) which requires monster GPU time for every reply token
There are obviously in between as well. Hence why the question is so loaded
Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one
If not, you are getting diminishing benefits for each param you add
I’m extreme cases your model can even perform worse with more param due to lack of training data
More complicated: the quality of the data matters as well
So there are 2 major directions. Build efficient models with good dataset and optimal param count for the task
Or go big on everything (aka openAI) which requires monster GPU time for every reply token
There are obviously in between as well. Hence why the question is so loaded
Ballpark: if your not setting aside a 100k for GPUs alone, to train a 60B model from scratch, your probably not ready to train one