Hacker News new | past | comments | ask | show | jobs | submit login

> Keep in mind too though that using and scaling GPUs is not free. You have to run the models somewhere.

Long or medium term these will probably be dirt cheap to just run in the background though. It might be within 3-5 years since parallel compute is still growing and isn’t as bounded by moores law stagnation




I get decent performance with my 4090, enough that LLMs with exllama at 30B quantitized are very usable. But we're severely VRAM limited, especially on lower end hardware which rarely sees > 10GB of VRAM.

I don't know how much slower it could be and still be useful though. The big thing is we need more VRAM, 30B is context length limited with only 24GB of vram, I've only barely made it above 3.2k tokens before running out.

I hope you're right, that it becomes common for systems to have either dedicated TPU type stuff similar to smartphones, and that they absolutely load the crap with VRAM (which I don't think is even that expensive?)

Models will also get smaller but I'm skeptical we'll get GPT4 performance with any useful context length under 24GB VRAM any time soon.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: