Context caching guide

_pdp_ · 2024-05-16T19:41:11 1715888471

We envisioned a system like this over a year ago, but since we lacked technical capabilities in this area, we couldn't solve it. To me, it is clear that a lot of information is being transmitted needlessly. There is no need to send an entire book every single time an interaction occurs. Instead, you can start from a checkpoint, remap from memory, and continue from there. Perhaps none of the current LLM service providers have an architecture that allows this, but it seems like something that should be possible and is likely to emerge in the near future.

gradys · 2024-05-16T19:46:50 1715888810

The size of the cached internal state of the network processing the book is much larger than the size of the book. The resource that is preserved with caching is the compute required to recreate that state.

dfgtyu65r · 2024-05-16T19:55:19 1715889319

Sure, but a direct forwards pass of the book would surely require more compute than simply loading and setting the hidden state?

The second doesn't require any matrix operations, it's just setting some values.

rfoo · 2024-05-16T20:07:17 1715890037

> it's just setting some values

But it may very well be slower than just recompute it. At least for ordinary MHA and even GQA.

So, either a model arch woodoo significantly reducing kv cache size (while keeping roughly the same compute cost), or some really careful implementation moving kv cache of upcoming requests to devices in background [0].

[0] My back of envelop calc shows that even then it still does not make sense for, say, Llama 3 70B on H100s. Time to stare at TPU spec harder trying to make sense of it I guess.

sshumaker · 2024-05-16T21:36:05 1715895365

It depends on how large the input prompt (previous context) is. Also, if you can keep cache on GPU with a LRU mechanism, for certain workloads it's very efficient.

You can also design an API optimized for batch workloads (say the same core prompt with different data for instruct-style reasoning) - that can result in large savings in those scenarios.

ethbr1 · 2024-05-16T21:57:27 1715896647

If you can pipeline upcoming requests and tie state to a specific request, doesn't that allow you to change how you design physical memory? (at least for inference)

Stupid question, but why wouldn't {extremely large slow-write, fast-read memory} + {smaller, very fast-write memory} be a feasible hardware architecture?

If you know many, many cycles ahead what you'll need to have loaded at a specific time.

Or hell, maybe it's time to go back to memory bank switching.

derefr · 2024-05-17T01:50:58 1715910658

The throughput of the PCIe link between the CPU and GPU, is far less than the aggregate throughput of the internal interconnects between neighbouring tensor cores.

Matrix operations might flow a lot of data around — but that data flow is akin to a bunch of individual people travelling along the individual residential streets they live on. There's a lot of movement there, but also a lot of capacity for movement, because there's no bottleneck of everyone needing to go to the same place or come from the same place.

Persisting the data out of the GPU and then loading it back in, is more like all those people commuting to work and then going back home. Big fan-in onto the PCIe "highway" over to the CPU and into RAM; then big fan-out back. Traffic jams for miles.

In the time it takes to restore a 1GB state snapshot from RAM into VRAM, you can probably chew through the equivalent of 1TB or more of intermediate matrix states.

gradys · 2024-05-16T20:06:40 1715890000

I don’t know of any public details on how they implement Context Caching, but that is presumably exactly what they are doing. Just caching the text would be a minimal savings.

nancarrow · 2024-05-16T20:05:14 1715889914

"some" is doing a lot of lifting. # of tokens * # of layers * head dimension * # of heads * 2 (K+V vectors) * 4-16bits (depending on quantization)

objektif · 2024-05-16T20:06:51 1715890011

But isnt the information somehow cached when you start a new chat and build context with say GPT4? If the caching was so large as you say so many chat sessions in parallel would not be possible.

dosinga · 2024-05-16T20:16:58 1715890618

That's not my understanding. We can't be sure how OpenAI does things themselves, but adding messages to a conversation in the API means just rerunning the history through the prompt every time

jsemrau · 2024-05-16T20:12:48 1715890368

>The size of the cached internal state of the network processing the book is much larger than the size of the book

It's funny that sometimes people consider LLMs as compression engines. While a lot of information gets lost in each direction (through the neural net)

shwaj · 2024-05-16T20:46:22 1715892382

Why is that funny? Sometimes compression is lossy, like JPEG and H.265

pornel · 2024-05-16T21:01:50 1715893310

And the internal state of a JPEG decoder can be an order of magnitude larger than the JPEG file (especially progressive JPEG that can't stream its output).

okdood64 · 2024-05-16T21:03:09 1715893389

I don't lose anything with gzip or rar.

fwip · 2024-05-16T21:40:16 1715895616

You can make any lossy compression scheme into a lossless scheme by appending the diff between the original and the compressed. In many cases, this still results in a size savings over the original.

You can think of this as a more detailed form of "I before E, except after C, except for species and science and..." Or, if you prefer, as continued terms of a Taylor-series expansion. The more terms you add, the more closely you approximate the original.

giancarlostoro · 2024-05-16T21:10:02 1715893802

And just as fast? The issue here is how do you do these things both accurately and while maintaining reasonable speeds.

sshumaker · 2024-05-16T21:33:48 1715895228

They are almost certainly doing this internally for their own chat products.

The simple version of this just involves saving off the KV cache in the attention layers, and restore it back instead of recomputing. It only requires small changes to inference and the attention layers.

The main challenge is being able to do this under scale, e.g. dump the weights out of GPU memory, persist them, and have a system to rapidly reload them as needed (or just regenerate).

ethbr1 · 2024-05-16T21:49:26 1715896166

2024 is the year of serverless LLM?

conradev · 2024-05-16T22:20:57 1715898057

To me, context caching is only a subset of what is possible with full control over the model. I consider this a more complete list: https://github.com/microsoft/aici?tab=readme-ov-file#flexibi...

Context caching only gets you “forking generation into multiple branches” (i.e. sharing work between multiple generations)

twobitshifter · 2024-05-16T21:51:44 1715896304

I remember hearing that the reason Claude is expensive is that every interaction makes it reread the entire conversation.

jonplackett · 2024-05-16T22:11:14 1715897474

This is the case with all LLMs as far as I know.

With the chatGPT api you just send everything up to that point + the new input to get the new output.

I think the benefit for the service is that it’s stateless. They just have requests in and out and don’t have to worry about anything else.

waldrews · 2024-05-17T00:11:36 1715904696

The pricing is still such that you can't routinely use a customized Gemini with your own long fixed pre-filled context + a variable short query. If the one-time compute cost of the caching could be effectively amortized over many queries, this pattern would replace many fine-tuning and RAG cases with something more predictable and controllable.

It's not as simple as that because the large cache needs to be loaded into GPU memory every time, but optimizations must be feasible if the usage rate is large enough to keep the cache alive in a dedicated machine.

Previous discussion: https://news.ycombinator.com/item?id=40034972#40036309

navaed01 · 2024-05-17T01:53:23 1715910803

I’m really glad they released context caching- the are legitimate use cases where a single or multiple users in an organization are all using a long prompt with examples in it. Which impacts cost and speed

_cs2017_ · 2024-05-16T23:05:15 1715900715

How is this different from KV caching?

(For reference, basic KV caching is explained here: https://www.omrimallis.com/posts/techniques-for-kv-cache-opt....)

luckyt · 2024-05-17T00:04:46 1715904286

There are several issues that make the KV cache as-is unsuitable for caching across requests. First, it requires the cached tokens to be in the exact same position in the sentence, this means it's mainly only useful for autoregressive generation where the prefix is always the same. Second, it is extremely big, so without some sort of compression, the cost to store it between requests and the time required to transfer the data to the GPU will outweigh any compute savings.

lolpanda · 2024-05-16T22:13:38 1715897618

i think llama.cpp has context caching with "--prompt-cache" but it will result in a very large cache file. i guess it's also very expensive for any inference api provider to support caching as they have to persist the file and load/unload it each time.

sshumaker · 2024-05-16T21:30:29 1715895029

This is a pretty standard technique if you're running the models yourself. e.g. ChatGPT almost certainly does this.

There's even work that is more sophisticated in this domain that allows 'template' style partial caching: https://arxiv.org/abs/2311.04934

sebzim4500 · 2024-05-16T18:21:50 1715883710

Makes sense.

It isn't in the list of suggested usecases, but I wonder if this can be used to speed up tree of thoughts or similar prompt/search techniques.

It could also speed up restricted generation, e.g. when you force the model to output valid json.

verdverm · 2024-05-16T19:35:50 1715888150

another advantage may be to let the big library providers (LangChain | LlamaIndex) to implement this before releasing to developers at large

motoboi · 2024-05-16T19:05:37 1715886337

I suppose the caching is at a layer before the LLM.

mritchie712 · 2024-05-16T18:21:06 1715883666

I really want this (in OpenAI API). Does google not care about the reputation it's getting with this shit:

> We'll be launching context caching soon, along with technical documentation and SDK support.

it's not live and god knows when it will be

nextworddev · 2024-05-16T18:29:40 1715884180

They have adopted the strategy of mainly releasing to & iterating with the large enterprise customers, because they (rightly) realized that GA'ing to every developer makes no monetary sense, unless they are trying to learn something from that launch / need to do so for competitive reasons.

refulgentis · 2024-05-16T20:06:03 1715889963

Presumably, this would be a competitive reason, no? It would further the cost savings they GA'd with Gemini Flash, and it's a differentiator from every other provider.

mritchie712 · 2024-05-16T18:36:07 1715884567

fair point

aiauthoritydev · 2024-05-16T18:40:55 1715884855

This actually is smarter strategy IMO.

Throwing the hat in the ring forces your competitors think about it and make them work too. It also gives you first mover advantage making people think you did it first.

A lot of AI tools make more sense for their enterprise customers rather than people like us. People like us are good only for hype and not making money.

flipbrad · 2024-05-16T19:10:55 1715886655

Might help show prior art to defeat software patents too.

hidelooktropic · 2024-05-16T20:39:05 1715891945

Isn't this what the Assistants API is meant for? Honestly not sure, I haven't used it before but their documentation seems to suggest you can set up the assistant to already have this context, then just send API commands to it without said context.

swyx · 2024-05-16T21:08:29 1715893709

but they charge u for the full context every time, at least for now. latency would suggest that no caching is happening