You may have been running with CPU inference, or running models that don’t fit y...

You may have been running with CPU inference, or running models that don’t fit your VRAM.

I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.

Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).