Oh, that's really interesting, and makes sense intuitively. From the abstract:
> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.
Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.
> We find that current large language models are significantly under-trained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant ... the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
Assuming the GPT-3 authors know this, one could surmise they 10x'ed the number of training tokens also.
Edit: Should have kept reading. Sounds like GPT-3 was found to be undertrained.