>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) ...

npsomaratna · on Sept 4, 2023

Possibly a lot. See: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

isoprophlex · on Sept 4, 2023

Very interesting, thanks for sharing!

pluijzer · on Sept 4, 2023

I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.

elpocko · on Sept 4, 2023

Lora and Alpaca at Tanagra.

DarmokJalad1701 · on Sept 4, 2023

Llama, when the loss fell.

kmlx · on Sept 4, 2023

from this episode if i’m not mistaken: https://en.m.wikipedia.org/wiki/Darmok

i watched that series so many times…

DarmokJalad1701 · on Sept 4, 2023

Hence my username.

koprulusector · on Sept 4, 2023

There’s should also be a tribble in there, somewhere.

sp332 · on Sept 4, 2023

Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.

GaggiX · on Sept 4, 2023

Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.

cypress66 · on Sept 4, 2023

It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.