Well I'm not sure which models specifically work, but it runs on llama.cpp, whic...

Well I'm not sure which models specifically work, but it runs on llama.cpp, which would mean lama derivative ones. Here's a little table for quantized CPU (GGML) versions and the RAM they require as a general rule of thumb:

> Name Quant method Bits Size RAM required Use case

WizardLM-7B.GGML.q4_0.bin q4_0 4bit 4.2GB 6GB 4bit.

WizardLM-7B.GGML.q4_1.bin q4_0 4bit 4.63GB 6GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

WizardLM-7B.GGML.q5_0.bin q5_0 5bit 4.63GB 7GB 5-bit. Higher accuracy, higher resource usage and slower inference.

WizardLM-7B.GGML.q5_1.bin q5_1 5bit 5.0GB 7GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

WizardLM-7B.GGML.q8_0.bin q8_0 8bit 8GB 10GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

> Name Quant method Bits Size RAM required Use case

wizard-vicuna-13B.ggmlv3.q4_0.bin q4_0 4bit 8.14GB 10.5GB 4-bit.

wizard-vicuna-13B.ggmlv3.q4_1.bin q4_1 4bit 8.95GB 11.0GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

wizard-vicuna-13B.ggmlv3.q5_0.bin q5_0 5bit 8.95GB 11.0GB 5-bit. Higher accuracy, higher resource usage and slower inference.

wizard-vicuna-13B.ggmlv3.q5_1.bin q5_1 5bit 9.76GB 12.25GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

wizard-vicuna-13B.ggmlv3.q8_0.bin q5_1 5bit 16GB 18GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

> Name Quant method Bits Size RAM required Use case

VicUnlocked-30B-LoRA.ggmlv3.q4_0.bin q4_0 4bit 20.3GB 23GB 4-bit.

VicUnlocked-30B-LoRA.ggmlv3.q4_1.bin q4_1 5bit 24.4GB 27GB 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

VicUnlocked-30B-LoRA.ggmlv3.q5_0.bin q5_0 5bit 22.4GB 25GB 5-bit. Higher accuracy, higher resource usage and slower inference.

VicUnlocked-30B-LoRA.ggmlv3.q5_1.bin q5_1 5bit 24.4GB 27GB 5-bit. Even higher accuracy, and higher resource usage and slower inference.

VicUnlocked-30B-LoRA.ggmlv3.q8_0.bin q8_0 8bit 36.6GB 39GB 8-bit. Almost indistinguishable from float16. Huge resource use and slow. Not recommended for normal use.

Copied of some of The-Bloke's model descriptions on huggingface. With 16G you can run practically all 7B and 13B versions. With shared GPU+CPU inference, one can also offload some layers onto a GPU (not sure if that makes the initial RAM requirement smaller), but you do need CUDA of course.