You don't need "GPT4" though. Mixtral 8x7B is robust and can be run in 36 Gb, 24 Gb if you're willing to compromise. A 1.5 bit quantization should bring it down to 16. That's still a lot compared to the iPhone 15's 6, but it's close enough to imagine it happening soon. With some kind of streaming-from-flash architecture you might be in the realm already.
> With some kind of streaming-from-flash architecture you might be in the realm already.
I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.
There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.
For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.