Someone got ZLUDA running llama.cpp a while back (search the ZLUDA/llama.cpp iss...

Someone got ZLUDA running llama.cpp a while back (search the ZLUDA/llama.cpp issues). If I recall, it ran about half the speed of the existing ROCm implementation.

I tried ROCm on my iGPU last year and you do get a bit of a benefit for prompt processing (5x faster) but inference is basically bottlenecked by the memory bandwidth whether you're on CPU or GPU. Here were my results: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

Note, only GART memory, not GTT is accessible in most inferencing options, so you will only basically be able to load 7B quantized models into "VRAM" (BIOS controlled, but usually max out at 6-8GB). I have some notes on this here: https://llm-tracker.info/howto/AMD-GPUs#amd-apu

If you plan on running a 7B-13B model locally, getting something like a RTX 3060 16GB (or if you're on Linux, the 7600 XT 16GB might be an option w/ HSA_OVERRIDE) is probably your cheapest realistic option and will give you about 300 GB/s of memory bandwidth and enough memory to run quantizes of 13/14B class models. If you're buying a card specifically for GenAI and not going to dedicate time to fight driver/software issues, I'd recommend going w/ Nvidia options first (they typically actually give you more Tensor TFLOPS/$ as well).