> Keep in mind too though that using and scaling GPUs is not free. You have to r...

thewataccount · on Aug 8, 2023

I get decent performance with my 4090, enough that LLMs with exllama at 30B quantitized are very usable. But we're severely VRAM limited, especially on lower end hardware which rarely sees > 10GB of VRAM.

I don't know how much slower it could be and still be useful though. The big thing is we need more VRAM, 30B is context length limited with only 24GB of vram, I've only barely made it above 3.2k tokens before running out.

I hope you're right, that it becomes common for systems to have either dedicated TPU type stuff similar to smartphones, and that they absolutely load the crap with VRAM (which I don't think is even that expensive?)

Models will also get smaller but I'm skeptical we'll get GPT4 performance with any useful context length under 24GB VRAM any time soon.