The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181
NVidia segments product lines on VRAM size. Consumer cards top out at 24GB. If you want more than that you have to buy a datacenter card for 10x the price.
Everybody's server costs are about to go the roof.