Hacker News new | past | comments | ask | show | jobs | submit login

Nice. You need a 28GB GPU, so it's not exactly something people can run on their laptop.

Everybody's server costs are about to go the roof.




I'm running it on my Thinkpad in CPU-only mode w/ 64GB ram. It's takes two to five seconds per token but it's perfectly usable.


Or use the CPU and be limited by RAM instead of VRAM. Luckily, even with less than 32GB RAM, you can always add a swapfile to use your disk as RAM :)


The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181


4bit quantized should run on less than 9gb vram.


Wouldnt it be time for a somewhat older gpu with a lot of memory. Or is that hard to achieve?


NVidia segments product lines on VRAM size. Consumer cards top out at 24GB. If you want more than that you have to buy a datacenter card for 10x the price.


There's AMD Instinct series, MI50 (32GB) goes for under a 1000 EUR where I live.


Has anyone tried the Biren GPUs from China? What's pricing like?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: