Hacker News new | past | comments | ask | show | jobs | submit login

llama.cpp focuses on optimizing inference on a CPU, while exllama is for inference on a GPU.



Thanks. I thought llama.cpp got CUDA capabilities a while ago? https://github.com/ggerganov/llama.cpp/pull/1827


Oh it seems you're right, I had missed that.

As far as I can see llama.cpp with CUDA is still a bit slower than ExLLaMA but I never had the chance to do the comparison by myself, and maybe it will change soon as these projects are evolving very quickly. Also I am not exactly sure whether the quality of the output is the same with these 2 implementations.


Until recently, exllama was significantly faster, but they're about on par now (with llama.cpp pulling ahead on certain hardware or with certain compile-time optimizations now even).

There are a couple big difference as I see it. llama.cpp uses `ggml` encoding for their models. There were a few weeks where they kept making breaking revisions which was annoying, but it seems to have stabilized and now also supports more flexible quantization w/ k-quants. exllamma was built for 4-bit GPTQ quants (compatible w/ GPTQ-for-LLaMA, AutoGPTQ) exclusively. exllama still had an advantage w/ the best multi-GPU scaling out there, but as you say, the projects are evolving quickly, so it's hard to say. It has a smaller focus/community than llama.cpp, which also has its pros and cons.

It's good to have multiple viable options though, especially if you're trying to find something that works best w/ your environment/hardware and I'd recommend anyone to HEAD checkouts a try for both and see which one works best for them.


Thank you for the update! Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama? Also, in terms of VRAM consumption, are they equivalent?


ExLlama still uses a bit less VRAM than anything else out there: https://github.com/turboderp/exllama#new-implementation - this is sometimes significant since from my personal experience it can support full context on a quantized llama-33b model on a 24GB GPU that can OOM w/ other inference engines.

oobabooga recently did a direct perplexity comparison against various engines/quants: https://oobabooga.github.io/blog/posts/perplexities/

On wikitext, for llama-13b, the perplexity of a q4_K_M GGML on llama.cpp was within 0.3% of the perplexity of a 4-bit 128g desc_act GPTQ on ExLlama, so basically interchangeable.

There are some new quantization formats being proposed like AWQ, SpQR, SqueezeLLM that perform slightly better, but none have been implemented in any real systems yet (the paper for SqueezeLLM is the latest, and has comparison vs AWQ and SpQR if you want to read about it: https://arxiv.org/pdf/2306.07629.pdf)



Thank you.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: