Hacker News new | past | comments | ask | show | jobs | submit login

>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100.

They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.





Very interesting, thanks for sharing!


I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.


Lora and Alpaca at Tanagra.


Llama, when the loss fell.


from this episode if i’m not mistaken: https://en.m.wikipedia.org/wiki/Darmok

i watched that series so many times…


Hence my username.


There’s should also be a tribble in there, somewhere.


Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.


Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.


It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: