That is with 4-bit quantization. For practical purposes I don't see the point of... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

int_19h 6 months ago | parent | context | favorite | on: Fine tune a 70B language model at home

That is with 4-bit quantization. For practical purposes I don't see the point of running anything higher than that for inference.

AnthonyMouse 6 months ago [–]

That's interesting though, because it implies the machine is compute-bound. A 4-bit 120B model is ~60GB, so you should get ~13 tokens/second out of 800GB/s if it was memory-bound. 7/s implies you're getting ~420GB/s.

And the Max has half as many cores as the Ultra, implying it would be compute-bound too.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact