Hacker News new | past | comments | ask | show | jobs | submit login

>Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.

That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".

Key part being: "for any arbitrary performance".




Any paper that's trained more than one model size on the same data affirms the same thing.

Llama 13b was better than 7b and Llama 66b was better than 33b.

If you're bothered with how general a statement I'm making then Ok, point is that all training so far has pointed towards that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: