Is the title misleading here ? 30B quantized requires 19.5 GB, not 6GB; Otherwis...

xiphias2 · on April 1, 2023

Now it's clear that there was a bug in the measurement. The author used a machine with lots of RAM, so I guess most of us are still stuck with quantized 13B. Still, the improvement hopefully translates, and I hope that 30B will run with 3 bit quantization in a few days.

diimdeep · on April 1, 2023

Also, current SSD's achieve 7.5 GB/s+ read speed, opposed to older SSD from 2013 with 500 MB/s, so performance will drastically differ depending on your system specs in case of pulling weights from disk to RAM on demand. Also, there is $ vmmap <pid> where we can see various statistics about process memory and used swap, that are not available in top or htop.

freehorse · on April 1, 2023

Even with 7.5GB/s you are gonna at best achieve 2.7 seconds for a computing a token, in a hyperoptimistic scenario that you can actually achieve that speed in reading the file, which is too slow for doing much. Maybe if one could get the kernel to swap more aggressively or sth it could cut half that time or so, but it still would be quite slow.

renewiltord · on March 31, 2023

That's the size on disk, my man. When you quantize it to a smaller float size you lose precision on the weights and so the model is smaller. Then here they `mmap` the file and it only needs 6 GiB of RAM!

gliptic · on April 1, 2023

The size mentioned is already quantized (and to integers, not floats). mmap obviously doesn't do any quantization.