Offloading is another popular method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.
It turns out, Petals is faster than offloading even though it communicates over the Internet (possible, with servers far away from you). That's because Petals only sends NN activations between servers (a small amount of data), while offloading copies hundreds of GB of NN weights to GPU VRAM to generate each new token.
Interestingly it sounds like offloading could be made quite efficient in a batch setting if you primarily care about throughput rather than latency. Though I guess for most current LLM applications latency is quite important.
It turns out, Petals is faster than offloading even though it communicates over the Internet (possible, with servers far away from you). That's because Petals only sends NN activations between servers (a small amount of data), while offloading copies hundreds of GB of NN weights to GPU VRAM to generate each new token.