Hacker News new | past | comments | ask | show | jobs | submit login

You may have been running with CPU inference, or running models that don’t fit your VRAM.

I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.

Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: