Hacker News new | past | comments | ask | show | jobs | submit login

It might be tough to make more efficient, but $15k seems exactly about the price of "stick 6 4090s in a decent box and throw in a couple grand for my troubles", versus any revolutionary hardware configuration. The way it advertises running fp16 Llama 70B feels a bit contrived too, given the prevelance of quantizing to 8 bit at minimum.



In my opinion the best hardware to run big models is to go and get a mac studio ultra. You have 192GB of unified RAM which can run pretty much every available model without losing performance. And it would cost you half that price.


> without losing performance

But isn't M2 ULTRA over 20x slower than this thing? ~30 TFlops vs 738.


By losing performance I meant you don’t need to quantize the model a lot since it fits in RAM. My bad for not clarifying it.


According to this [1] article (current top of hn) memory bandwidth is typically the limiting factor, so as long as your batch size isn't huge you probably aren't losing too much in performance.

https://finbarr.ca/how-is-llama-cpp-possible/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: