I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.
Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).
I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.
Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).