Basic math related to computation and memory usage for transformers

monkmartinez · on April 19, 2023

Great article... However, the proliferation of "quantization" (8bit, 4bit, 3, 2, etc.) so normies like myself can run transformer based models on consumer grade has changed this math significantly. It has also changed the landscape for text generation at such a pace that its nearly impossible to keep up.

I don't look at any model the same after head to head comparisons with full precision and quantization at 4bit have run on my machine. There is little to no perceptible change with models of the same initial weight. BUT!!! I am now able to run models that required a DGX a few weeks ago on my home computer thanks to quantization. These models are better in every way from my POV. I am now more interested in what I can "do" with the models vs. just getting them to run. 30B at 4 bits is the sweet spot for my setup.

mmoskal · on April 19, 2023

TFA is about training, which happens at 16-32 bits.

psyklic · on April 19, 2023

Another related one: https://kipp.ly/blog/transformer-inference-arithmetic/

tinglymintyfrsh · on April 19, 2023

The title: for a second, I though people were using eddy currents in the electrical grid to perform computation. Maybe it's Turing complete.

sroussey · on April 19, 2023

Nice article, though I feel something went amiss with this part:

$$ \begin{align}\text{Total Memory}{\text{Training}} = \text{memory}{\text{model}}+\text{memory}{\text{optimizer}}+\text{memory}{\text{activations}}+\text{memory}_{\text{gradients}}\end{align} $$

teruakohatu · on April 19, 2023

Do you have javascript disabled? That is Latex which should be converted to images (or svg) dynamically after the page loads.

sroussey · on April 19, 2023

Nope. Stock iOS safari.

letitgo12345 · on April 19, 2023

Shouldn't the memory needed scale quadratically with sequence length rather than the linear scaling they have in their equations?

visarga · on April 19, 2023

Not if they use flash attention which solves the problem in fixed memory by working tile by tile. They never materialise the whole attention matrix at once. But the computation time is still quadratic.

letitgo12345 · on April 19, 2023

They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.

visarga · on April 20, 2023

They don't mention it explicitly except in one place:

> GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 FLOP/s/A100 with Flash Attention.

This advice implies they are using flash attention.

akomtu · on April 19, 2023

Has anyone tried to make it cubic? LLMs have reached the current limit by increasing window size, but who knows what increasing dimensionality will bring.

visarga · on April 19, 2023

Great post! Very detailed explanations. This makes training large models easier to get into for other teams.