It could be even smaller than a Chinchilla optimal model. The Chinchilla paper w... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

sacred_numbers on March 1, 2023 | parent | context | favorite | on: Introducing ChatGPT and Whisper APIs

It could be even smaller than a Chinchilla optimal model. The Chinchilla paper was about training the most capable models with the least training compute. If you are optimizing for capability and inference compute you can "over-train" by providing much more data per parameter than even Chinchilla, or you can train a larger model and then distill it to a smaller size. Increasing context size increases inference compute, but the increased capabilities of high context size might allow you to skimp on parameters and lead to a net decrease in compute. There's probably other strategies as well, but those are the ones I know of.

habitue on March 2, 2023 [–]

Ah! Interesting, I thought the capability was capped by parameters, but you're saying you can keep getting more capability from a fixed parameter size by continuing to train past what the chinchilla paper specifies. That's really cool

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact