Hacker News new | past | comments | ask | show | jobs | submit login

Someone got ZLUDA running llama.cpp a while back (search the ZLUDA/llama.cpp issues). If I recall, it ran about half the speed of the existing ROCm implementation.

I tried ROCm on my iGPU last year and you do get a bit of a benefit for prompt processing (5x faster) but inference is basically bottlenecked by the memory bandwidth whether you're on CPU or GPU. Here were my results: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Note, only GART memory, not GTT is accessible in most inferencing options, so you will only basically be able to load 7B quantized models into "VRAM" (BIOS controlled, but usually max out at 6-8GB). I have some notes on this here: https://llm-tracker.info/howto/AMD-GPUs#amd-apu

If you plan on running a 7B-13B model locally, getting something like a RTX 3060 16GB (or if you're on Linux, the 7600 XT 16GB might be an option w/ HSA_OVERRIDE) is probably your cheapest realistic option and will give you about 300 GB/s of memory bandwidth and enough memory to run quantizes of 13/14B class models. If you're buying a card specifically for GenAI and not going to dedicate time to fight driver/software issues, I'd recommend going w/ Nvidia options first (they typically actually give you more Tensor TFLOPS/$ as well).




This is an amazing experiment summary! I wonder if your APU conclusion still stands (CPU for bigger models otherwise APU), given all the advancements so far.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: