You don't need "GPT4" though. Mixtral 8x7B is robust and can be run in 36 Gb, 24...

creshal · 2024-02-28T15:24:57 1709133897

> With some kind of streaming-from-flash architecture you might be in the realm already.

I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.

liuliu · 2024-02-28T18:40:24 1709145624

There is a more detailed paper from Apple on this. Basically, you can do a little bit better than only keeping current weights in RAM with mmap.

For LLM, you are mostly dealing with b = W @ a where a and b are vectors, only W is the matrix. If a is sparse (i.e. have a few 0s), you don't need all the columns from W to do the matrix-vector multiplication. A cleverly arranged W can make sure during inference, only related columns loaded from flash. Further more, if you can apply "One Weird Trick" paper to this matrix-vector multiplication, you can shard W by rows, i.e. `b[i:i+n] = W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that while the previous b[i:i+n] is still computing, you have visibility on which columns of the next matrix to be loaded already.

cjbprime · 2024-02-28T21:15:47 1709154947

You need all of the model in RAM to perform the matmult that gets you the next token from it. There's no shortcut.

imtringued · 2024-02-28T15:45:07 1709135107

I'm not sure what use that is, other than to maintain the KV cache across requests.