Hacker News new | past | comments | ask | show | jobs | submit login
Basic math related to computation and memory usage for transformers (eleuther.ai)
168 points by tim_sw on April 19, 2023 | hide | past | favorite | 13 comments

Great article... However, the proliferation of "quantization" (8bit, 4bit, 3, 2, etc.) so normies like myself can run transformer based models on consumer grade has changed this math significantly. It has also changed the landscape for text generation at such a pace that its nearly impossible to keep up.

I don't look at any model the same after head to head comparisons with full precision and quantization at 4bit have run on my machine. There is little to no perceptible change with models of the same initial weight. BUT!!! I am now able to run models that required a DGX a few weeks ago on my home computer thanks to quantization. These models are better in every way from my POV. I am now more interested in what I can "do" with the models vs. just getting them to run. 30B at 4 bits is the sweet spot for my setup.

TFA is about training, which happens at 16-32 bits.

The title: for a second, I though people were using eddy currents in the electrical grid to perform computation. Maybe it's Turing complete.

Nice article, though I feel something went amiss with this part:

$$ \begin{align}\text{Total Memory}{\text{Training}} = \text{memory}{\text{model}}+\text{memory}{\text{optimizer}}+\text{memory}{\text{activations}}+\text{memory}_{\text{gradients}}\end{align} $$

Do you have javascript disabled? That is Latex which should be converted to images (or svg) dynamically after the page loads.

Nope. Stock iOS safari.

Shouldn't the memory needed scale quadratically with sequence length rather than the linear scaling they have in their equations?

Not if they use flash attention which solves the problem in fixed memory by working tile by tile. They never materialise the whole attention matrix at once. But the computation time is still quadratic.

They present it as an article about transformers in general, not ones using Flash Attention. Anyway maybe they're presenting per token memory requirement instead of the requirement for the entire sequence at once.

They don't mention it explicitly except in one place:

> GPT-NeoX achieves 150 TFLOP/s/A100 with normal attention and 180 FLOP/s/A100 with Flash Attention.

This advice implies they are using flash attention.

Has anyone tried to make it cubic? LLMs have reached the current limit by increasing window size, but who knows what increasing dimensionality will bring.

Great post! Very detailed explanations. This makes training large models easier to get into for other teams.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
