Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.
At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit
My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set
At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit
My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set