Hacker News new | past | comments | ask | show | jobs | submit login

Not really sure what exactly was said. But in a 2 GPU set, you can technically live load weights on 1 GPU while running inference on the other.

At fp32 precision, storing a single layer takes around 40*d_model^2 bytes assuming context length isn’t massive relative to d_model (which it isn’t in GPT-4). At 80GB GPU size this means 40k model width could be stored as a single layer on 1 GPU while still leaving space for the activations. So theoretically any model below this width could run on a 2 GPU set. Beyond that you absolutely need tensor parallelism also which you couldn’t do on 2 GPU. But I think it is a safe assumption that GPT4 has sub 40k model width. And of course if you quantize the model you could even run 2.8x this model width at 4bit

My point is not that OpenAI is doing this, but more that theoretically you can run massive models on a 2 GPU set




Without performance penalties? If the model is larger than the vram you have to constantly be pulling data from disk/ram right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: