Hacker News new | past | comments | ask | show | jobs | submit login

That is with 4-bit quantization. For practical purposes I don't see the point of running anything higher than that for inference.



That's interesting though, because it implies the machine is compute-bound. A 4-bit 120B model is ~60GB, so you should get ~13 tokens/second out of 800GB/s if it was memory-bound. 7/s implies you're getting ~420GB/s.

And the Max has half as many cores as the Ultra, implying it would be compute-bound too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: