26× Faster Inference with Layer-Condensed KV Cache for Large Language Models

vessenes · 2024-05-20T18:01:46 1716228106

Upshot of the paper -- right now KV caches are implemented for multiple deep layers in LLMs. Why not just the top layer? It would save memory.

Initial result -- those KV caches in lower layers matter, and output suffered.

Updated plan -- cull half the KV layers! This works 'nearly' as well as keeping all of them, with memory and compute savings.

Downside - triple the training, worse out of band / long context performance.

This feels to me like a technique you'd use on a particular architecture deployed at the edge where compute matters and you have a little extra room on performance. Phi-3 on raspberry pi, basically.

Interesting! As always, I wish models showed prompt output in their papers, not just perplexity numbers. But, here we are.

YetAnotherNick · 2024-05-20T22:36:14 1716244574

I wish papers could be accepted with negative results. There's a lot of value in not repeating the same mistakes specially in a field like deep learning, which is not an exact science and mostly just driven by intuition.

adtac · 2024-05-20T20:55:41 1716238541

> triple the training

From what I understand, not quite. It looks like the cost of training might be similar, but less parallelisable within a specific token sequence. This is because they have to compute the KV of token T before they can use it in T+1 whereas in a regular training process you can compute the KV at each layer for every subsequence. You're right that it took 2.7x longer to train the smallest model but I wouldn't be surprised if the GPU utilisation was proportionally lower too.

algo_trader · 2024-05-20T22:12:53 1716243173

In general, are the per-layer KV caches all independent of each other?

If we split a model layer-wise for inference across X GPUs, does the cache split as well?

WhitneyLand · 2024-05-20T19:04:16 1716231856

Not sure if @dang is the right way to say the title is incorrect here, but shouldn't it match the paper?

1. The correct title is Layer-Condensed KV Cache for Efficient Inference of Large Language Models.

2. The paper does make a 26x claim later in the introduction, but it’s an outlier.

26x is for only one benchmark and that benchmark is CPU based, not GPU based like 99% of transformer loads actually run on.

If you look at GPU only workloads, the improvements range from 1.4x to 4.7x.

wglb · 2024-05-21T15:45:52 1716306352

> Not sure if @dang is the right way to say the title is incorrect here, but shouldn't it match the paper?

Best way to address this is to email hn@ycombinator.com.

It is not possible to read all posts to HN.

(Also, the convention is to use 'dang rather than @dang, borrowing from a lisp convention.)

imtringued · 2024-05-21T06:24:06 1716272646

It's not an outlier. You will see more than 26x improvement if you try this on an even deeper LLM with more layers. The deepest model they have applied it on has 30 billion parameters.

Edit: I apologize. The table was cut off on mobile and I didn't see that they sneaked in GPU+CPU offloading for the 25x result.

kiraaa · 2024-05-20T19:54:24 1716234864

on gpu that is still huge.

0cf8612b2e1e · 2024-05-20T20:15:39 1716236139

Even if it is “only” the 40% lower end, that is a gargantuan savings. So many groups are compute constrained, every bit helps.

josephg · 2024-05-20T21:14:07 1716239647

Sure; but 40% improvement is much less than a 26x improvement. If 40% is the realistic figure, cite that. Changing the title to include an outlier of 26x is click baity.

VeejayRampay · 2024-05-20T20:21:55 1716236515

it sure is huge, but it's still far from 26x

joaquincabezas · 2024-05-20T19:50:27 1716234627

LLM inference optimization has been key for the OpenAI GPT-4o presentation (2x faster, 50% cheaper) and its driving lots of industry research because it’s direct cost savings, but it’s refreshing to see so many techniques published as papers (i.e from Stanford, Berkeley…)

jasonjmcghee · 2024-05-20T19:21:23 1716232883

> please use the original title, unless it is misleading or linkbait; don't editorialize.

"Layer-Condensed KV Cache for Efficient Inference of Large Language Models"

jsemrau · 2024-05-20T20:12:58 1716235978

"Our implementation is based on HuggingFace transformers where we register a new model opt-llama that supports the Layer-Condensed KV Cache."

Not sure what this means? Would this work for a Mistral model as well?

tripplyons · 2024-05-20T19:00:03 1716231603

This can be combined with Grouped Query Attention or Multi-Query Attention for an even further reduction in the size of the KV Cache!

WhitneyLand · 2024-05-20T19:08:09 1716232089

But it’s not free, it cuts quality significantly.

It’s not hard to find ways to speed up transformers if you’re willing to give up quality.

You could argue some tradeoffs are worth it and that’s true sometimes but I don’t see that they’ve made the case for it here.

vlovich123 · 2024-05-20T19:12:43 1716232363

Is the KV cache something that runs on the GPU or on the CPU? Or traditionally on the CPU & this enables it to run on the GPU?

imtringued · 2024-05-21T07:21:12 1716276072

You have to run the entire model once for every token. To generate the first token you have to process the entire preceding context once. On the second token you don't need to recompute the context, you can just take what you have calculated in the previous pass. For attention you need to take the current token and run a pairwise computation against all the cached values so that you know which tokens relate to the current token.

This is unavoidable by the way. Any method to avoid quadratic attention will necessarily have to degrade accuracy, because less than quadratic means you will have to look at "less than all the tokens" in a given pass. You're bound to miss at least some of them. When you consider how simple the classical attention mechanism is, you realize that there is not much you can do. Any preattention pass is necessarily going to be more complicated than your quadratic attention mechanism.

What they do is just get rid of it entirely after w layers. This saves memory consumption, which allows you to fit more context in the same amount of VRAM. In the paper they decided to run more batches in parallel, but they could have also advertised a massive increase in context length. 128k context needs 16GB on my machine for llama3 7b. Their approach would allow at least 784k context in the same 16GB.

adtac · 2024-05-20T20:09:32 1716235772

The KV cache is just another tensor to be used with matmuls. Unlike the model weights which are fixed, the KV cache is uniquely constructed for every input. Think of it as the model growing new weights to represent the new knowledge it learns about the user's input at inference time because not everything can be baked into the pretrained model.

You want to store your KV cache in the same processor that does the rest of your matmuls.