>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100.
They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.
I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.
Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.
Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.
They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.