GPT-2 training cost 10s of thousands GPT-3 training cost millions GPT-4 training...

cubefox · on May 11, 2023

> There is no way GPT-4 is the same size as GPT-3. Is it 1T parameters? I don't know. No one knows. But I think it is clear GPT-4 is significantly larger than GPT-3.

That's a fallacy. GPT-3 wasn't trained compute optimally. It had too many parameters. A compute optimal model with 175 billion parameters would require much more training compute. In fact, the Chinchilla scaling law allows you to calculate this value precisely. We could also calculate how much training compute a Chinchilla optimal 1 trillion parameter model would need. We would just need someone who does the math.

nabakin · on May 12, 2023

Why does it matter in this case if GPT-3 was trained compute optimally or not? Are you saying that the over $100 million training cost is amount of training necessary to make a 175B parameter model compute optimal? And if they are the name number of parameters, why is there a greater latency with GPT-4?