Hacker News new | past | comments | ask | show | jobs | submit login

It could be even smaller than a Chinchilla optimal model. The Chinchilla paper was about training the most capable models with the least training compute. If you are optimizing for capability and inference compute you can "over-train" by providing much more data per parameter than even Chinchilla, or you can train a larger model and then distill it to a smaller size. Increasing context size increases inference compute, but the increased capabilities of high context size might allow you to skimp on parameters and lead to a net decrease in compute. There's probably other strategies as well, but those are the ones I know of.



Ah! Interesting, I thought the capability was capped by parameters, but you're saying you can keep getting more capability from a fixed parameter size by continuing to train past what the chinchilla paper specifies. That's really cool




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: