> *Fine-tuning and inference up to 10x faster than offloading* What is "offloadi...

borzunov · on Jan 2, 2023

Offloading is another popular method for running large LMs when you don't have the GPU memory to fit the entire model. Imagine you have an A100 GPU with 80 GB memory and want to generate text with BLOOM, a 70-block transformer model with ~2.5 GB of weights per block. For each token, offloading will load the first 1/3 of the model (~27 blocks) from RAM/SSD to your GPU memory, run a forward pass through them, then free the memory and load the next 2/3, and so on.

It turns out, Petals is faster than offloading even though it communicates over the Internet (possible, with servers far away from you). That's because Petals only sends NN activations between servers (a small amount of data), while offloading copies hundreds of GB of NN weights to GPU VRAM to generate each new token.

madisonmay · on Jan 2, 2023

Interestingly it sounds like offloading could be made quite efficient in a batch setting if you primarily care about throughput rather than latency. Though I guess for most current LLM applications latency is quite important.

taink · on Jan 2, 2023

It's mentioned in their paper: https://arxiv.org/pdf/2209.01188.pdf

  Several recent works aim to democratize LLMs
  by “offloading” model parameters to slower but
  cheaper memory (RAM or SSD), then running
  them on the accelerator layer by layer (Pudipeddi
  et al., 2020; Ren et al., 2021). This method allows
  running LLMs with a single low-end accelerator
  by loading parameters from RAM justin-time for
  each forward pass. Offloading can be efficient for
  processing many tokens in parallel, but it has inher-
  ently high latency: for example, generating one to-
  ken with BLOOM-176B takes at least 5.5 seconds
  for the fastest RAM offloading setup and 22 sec-
  onds for the fastest SSD offloading. In addition,
  many computers do not have enough RAM to of-
  fload 175B parameters.

dpflan · on Jan 2, 2023

Is a mobile device / edge device a possible participant / source of resources?