>Google trained 3 differently sized models of the same architecture (8b, 62b, 540b) on the same dataset of 780b tokens and evaluated all 3 on various tasks.
That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".
That's quite a small sample to argue the generic point that "for any arbitrary performance x, the data required to reach it reduces with size".
Key part being: "for any arbitrary performance".