Nice. You need a 28GB GPU, so it's not exactly something people can run on their...

akiselev · on April 3, 2023

I'm running it on my Thinkpad in CPU-only mode w/ 64GB ram. It's takes two to five seconds per token but it's perfectly usable.

capableweb · on April 3, 2023

Or use the CPU and be limited by RAM instead of VRAM. Luckily, even with less than 32GB RAM, you can always add a swapfile to use your disk as RAM :)

lhl · on April 5, 2023

The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181

qeternity · on April 3, 2023

4bit quantized should run on less than 9gb vram.

holoduke · on April 3, 2023

Wouldnt it be time for a somewhat older gpu with a lot of memory. Or is that hard to achieve?

sbierwagen · on April 4, 2023

NVidia segments product lines on VRAM size. Consumer cards top out at 24GB. If you want more than that you have to buy a datacenter card for 10x the price.

tucnak · on April 4, 2023

There's AMD Instinct series, MI50 (32GB) goes for under a 1000 EUR where I live.

Animats · on April 4, 2023

Has anyone tried the Biren GPUs from China? What's pricing like?