Hacker News new | past | comments | ask | show | jobs | submit login

Is the title misleading here ?

30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk

  model    original size   quantized size (4-bit)
  7B     13 GB    3.9 GB
  13B    24 GB    7.8 GB
  30B    60 GB    19.5 GB
  65B    120 GB   38.5 GB



Now it's clear that there was a bug in the measurement. The author used a machine with lots of RAM, so I guess most of us are still stuck with quantized 13B. Still, the improvement hopefully translates, and I hope that 30B will run with 3 bit quantization in a few days.


Also, current SSD's achieve 7.5 GB/s+ read speed, opposed to older SSD from 2013 with 500 MB/s, so performance will drastically differ depending on your system specs in case of pulling weights from disk to RAM on demand. Also, there is $ vmmap <pid> where we can see various statistics about process memory and used swap, that are not available in top or htop.


Even with 7.5GB/s you are gonna at best achieve 2.7 seconds for a computing a token, in a hyperoptimistic scenario that you can actually achieve that speed in reading the file, which is too slow for doing much. Maybe if one could get the kernel to swap more aggressively or sth it could cut half that time or so, but it still would be quite slow.


That's the size on disk, my man. When you quantize it to a smaller float size you lose precision on the weights and so the model is smaller. Then here they `mmap` the file and it only needs 6 GiB of RAM!


The size mentioned is already quantized (and to integers, not floats). mmap obviously doesn't do any quantization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: