Prompt Caching

verdverm · 2024-08-18T21:40:20 1724017220

FWIW, Gemini / Vertex has this as well and lets you control the TTL. Billing is based on how long you keep the context

https://ai.google.dev/gemini-api/docs/caching?lang=python

Costs $1 / 1M / 1h

bastawhiz · 2024-08-18T22:33:51 1724020431

That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes. A million tokens is 4MB. Imagine paying $1/hr for less than the storage of three floppies. That's two million times more expensive than the storage cost of standard S3 (720 hours×256M tokens (1gb)×$1 vs $0.09). Or 2000 times more expensive than the storage cost of Elasticache serverless.

(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)

atgctg · 2024-08-19T07:01:56 1724050916

You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB

Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...

manojlds · 2024-08-19T08:43:08 1724056988

Is there some easy to understand source / paper about how this caching works?

xihajun · 2024-08-21T09:26:38 1724232398

https://arxiv.org/pdf/2311.04934

danielmarkbruce · 2024-08-19T22:05:43 1724105143

Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.

IanCal · 2024-08-18T22:38:40 1724020720

Well that really depends where you're caching the data.

Is it a lot for caching in L1 on a chip somewhere? No that'd be wildly cheap.

Is it a lot for "caching" on a tape somewhere? Yes.

So where on this scale does keeping it quick to get to gpu memory lie?

> That's two million times more expensive than the storage cost of standard S3 (

You're not comparing to s3 at all.

bastawhiz · 2024-08-18T22:42:57 1724020977

"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.

Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.

IanCal · 2024-08-18T22:46:41 1724021201

Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.

bastawhiz · 2024-08-18T23:10:26 1724022626

You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.

If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.

When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.

IanCal · 2024-08-18T23:34:03 1724024043

I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!

> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data

Is it a gigabyte of data and is it fast enough?

You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?

The entire point here is to lower latency and costs so it has to be close and fast.

Guessing at sizes isn't helping anything here.

GaggiX · 2024-08-18T22:56:55 1724021815

What is stored is not the tokens, but all keys and values of all attention layers for each token.

sdrg822 · 2024-08-18T23:01:17 1724022077

+1 it wouldn’t be terribly useful if it were only caching the tokenizer output.

bastawhiz · 2024-08-18T23:11:58 1724022718

As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.

hansonw · 2024-08-18T23:25:42 1724023542

It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.

karmasimida · 2024-08-19T06:53:11 1724050391

There are some errors in you calculation.

> A token is 32-bit integer.

No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.

So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.

Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.

So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.

GaggiX · 2024-08-18T23:23:25 1724023405

It's probably much more than a single GB of data for a million tokens.

Kiro · 2024-08-19T08:58:54 1724057934

It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.

rfoo · 2024-08-19T06:43:03 1724049783

It's more likely some 400GB of data.

karmasimida · 2024-08-19T06:43:30 1724049810

You need to cache all the key/values on each layer, otherwise there is no point in caching.

meiraleal · 2024-08-20T12:15:56 1724156156

So in your calculation all the money spent to create the tech is free?

verdverm · 2024-08-20T20:05:06 1724184306

The "cost" is the published price that Google charges for Gemini caching

meiraleal · 2024-08-20T20:41:40 1724186500

By this logic the Windows or MS office should have the price slightly higher than an USB sticker

verdverm · 2024-08-22T23:22:58 1724368978

What are you trying to get at, the meaning of "cost"?

When I used the word "cost", I am referring to the price I pay, my cost, for using caching. That number is a known, published figure

Scene_Cast2 · 2024-08-18T20:04:37 1724011477

Why does prompt caching reduce costs? I'm assuming that the primary cost driver is GPU/TPU FLOPS, as opposed to any network / storage / etc costs.

My understanding is that an LLM will take in the stream of text, tokenize it (can be faster with caching, sure, but it's a minor drop in the bucket), then run a transformer on the entire sequence. You can't just cache the output of a transformer on a prefix to reduce workload.

burtonator · 2024-08-18T20:09:06 1724011746

Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution.

By caching them they resume from where it left off from before thereby completely bypassing all that computation.

For large contexts this could save a ton of compute!

I think this feature and structured outputs are some of the biggest inventions in LLMs this year.

minimaxir · 2024-08-18T20:14:41 1724012081

Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.

brylie · 2024-08-18T20:37:37 1724013457

I’m building an app with OpenAI, using structured outputs. Does OpenAI also support prompt caching?

cma · 2024-08-18T20:54:34 1724014474

I'm sure internally they use it for the system prompt at least, probably since launch. And maybe for common initial user queries that exactly match.

Onavo · 2024-08-18T21:55:37 1724018137

They are certainly not passing the savings on to the users.

minimaxir · 2024-08-18T21:57:57 1724018277

Yet. I suspect OpenAI will release a similar offering soon. (hooray, free market competition!)

HeatrayEnjoyer · 2024-08-18T23:03:39 1724022219

That $100 billion data center has to get paid for somehow.

minimaxir · 2024-08-18T20:40:24 1724013624

Not currently.

pclmulqdq · 2024-08-18T20:08:17 1724011697

You actually can cache the "output" of a transformer on the prefix by caching what happens in the attention layer for that text string (specifically the "K" and "V" tensors). Since the attention layer is a big part of the compute cost of the transformer, this does cut down FLOPs dramatically.

Scene_Cast2 · 2024-08-18T21:16:11 1724015771

Oh interesting, didn't know. How does this work past the first transformer in the stack?

lonk11 · 2024-08-18T21:32:20 1724016740

My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.

danielmarkbruce · 2024-08-18T22:09:00 1724018940

I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.

logicchains · 2024-08-18T20:08:32 1724011712

The transformer only looks backwards, so if the first part of the sequence (the prompt) doesn't change, you don't need to rerun it again on that part, just on the part after it that changed. For use cases with large prompts relative to the output size (e.g. lots of examples in the prompt), this can significantly speed up the workload.

jncfhnb · 2024-08-18T21:03:05 1724014985

I think most architectures do a layer of normalization on the whole text embeddings before calculating attention which makes this infeasible

Shouldn’t be a huge deal to adjust imo

One of the bigger problems is that closed model providers don’t want to expose the embedding space and let’s users see what they have

danielmarkbruce · 2024-08-18T22:07:51 1724018871

I don't think the normalization makes it infeasible. They should be able to make an adjustment (the reverse of the normalization) in one operation. I think they are caching the attention calcs.

The hard thing (I think) is what to keep in the cache and where to keep it given you are serving lots of customers and the attention calc can be a large set of numbers pretty quickly.

jncfhnb · 2024-08-19T01:11:01 1724029861

It’s only “hard” because they don’t want to let customers supply the cache of course

danielmarkbruce · 2024-08-19T01:36:22 1724031382

I'm not sure that's it. Presumably they want to keep the cache in GPU memory?

jncfhnb · 2024-08-19T02:11:07 1724033467

That’s largely it imo. If you get get the embedding representations you could just recreate the model logic and then it’s no longer closed source.

danielmarkbruce · 2024-08-19T02:17:51 1724033871

no, you couldn't. If they just handed you the embedding layer it wouldn't help that much.

Plus, I'm reasonably certain they are caching the attention scores anyway.

danielmarkbruce · 2024-08-18T22:01:58 1724018518

They cache the results of the attention calc. For certain subsets which are common this makes a lot of sense. I'm surprised they can make it work though, given they are serving so many different users. Someone somewhere did some very clever engineering.

maged · 2024-08-18T20:07:22 1724011642

Why not? It's caching the state of the model after the cached prefix, so that inference workload doesn't need to be run again.

w10-1 · 2024-08-18T23:52:34 1724025154

Setting aside efficiency or accuracy, caching enhances the value of prompt engineering and thus increases the effective value of AI services (how the value is monetized or split is TBD).

Comments suggest that caching the state of the network might also reduce processing.

I wonder if it also permits better A/B-style testing by reducing the effect of cross-domain errors. If the AI service providers made it easy to provide feedback on post-cache responses, the providers could incorporate the quality-enhancement loop accelerating time to product-market fit (at the risk of increasing dependency and reducing ability to switch).

WiSaGaN · 2024-08-18T23:09:34 1724022574

This feature was first introduced by deepseek. And deepseek will just do it automatically for you. https://platform.deepseek.com/api-docs/news/news0802/

nextworddev · 2024-08-18T23:22:36 1724023356

Yep. All the LLM providers fast follow aka copy each other within months, though

rglover · 2024-08-18T21:48:27 1724017707

This is great news. Using Claude to build a new SaaS [1] and this will likely save me quite a bit on API costs.

[1] https://x.com/codewithparrot

MaximusLegroom · 2024-08-18T22:20:24 1724019624

I guess they got tired of losing customers to Deepseek. They introduced this feature a while ago and their prices were already miniscule given that they only have to compute 20B active parameters.

nprateem · 2024-08-18T21:19:48 1724015988

I just tried Claude the other day. What a breath of fresh air after fighting the dogshit that is OpenAI.

Far less "in the realm of", "in today's fast-moving...", multifaceted, delve or other pretentious wank.

There is still some though so they obviously used the same dataset that's overweight in academic papers. Still, I'm hopeful I can finally get it to write stuff that doesn't sound like AI garbage.

Kind of weird there's no moderation API though. Will they just cut me off if my customers try to write about things they don't like?

bastawhiz · 2024-08-18T22:38:31 1724020711

> Will they just cut me off if my customers try to write about things they don't like?

The response you get back will have a refusal, which is pretty standard

vellum · 2024-08-18T21:41:38 1724017298

You can try AWS Bedrock or Openrouter if that happens. They both have the Claude API.

xihajun · 2024-08-21T09:26:02 1724232362

Prompt + Lora? Train an adapter?

politelemon · 2024-08-19T04:02:03 1724040123

Will this be making its way to Bedrock?

kevsim · 2024-08-19T10:18:28 1724062708

How is the Anthropic stuff on Bedrock in general? We're using OpenAI stuff on Azure right now and it's frustrating how slowly stuff gets rolled out in our region(s).

mathgeek · 2024-08-18T19:59:00 1724011140

Direct link to the feature referenced: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

NBJack · 2024-08-18T19:51:01 1724010661

Sounds kinda useless, TBH. This sounds as if it assumes the exact same context window across requests. If so, given the 5 minute window, unless for example your entire team is operating in the same codebase at the same time, you won't really see any savings beyond simple prompts.

Are contexts included in the prompt cache? Are they identified as the same or not? What happens if we approach the 10k token range? 128k? 1M?

calendarsnack · 2024-08-18T20:01:51 1724011311

Here is a simple use-case that comes to mind. Let's say you have a medium-size repository, such that all the source files can fit in the context. You want to use Claude as an advanced auto-complete in a certain file, but it's important that it has visibility to other files in the repo.

You can put all the other source files in an initial system message and the current file in the user message. Then, if you call multiple autocompletes within 5 minutes of each other, you pay a drastically reduced price for including all of the other files in the context. Also, the latency is much reduced!

Yes, you could probably get a similar outcome by incorporating RAG, search tools etc to the autocomplete, but as a simple approach with fewer moving parts, the caching will reduce costs for this setup.

config_yml · 2024-08-18T20:13:00 1724011980

I have a system prompt of around 100k which includes a couple of internal schema definitions. Having this cached could be super useful to us.

BonoboIO · 2024-08-18T20:45:14 1724013914

100k … I hear and use all of these high token context LLMs, but often they fail to include all information that should be in their context.

Does your approach work for you?

social_quotient · 2024-08-18T21:15:55 1724015755

Would love if you could de identify and share a bit more detail on a prompt that big.

simonw · 2024-08-18T23:11:57 1724022717

It’s useful to consider the impact this has on conversations.

In an ongoing conversation with a model you end up re-submitting the full test of the previous conversation - both prompts and responses - at every step. This means the cost per prompt in that conversation increases each time.

Claude prompt caching can start saving you money even with just a single user having a conversation, provided each of their replies is within five minutes of the previous reply.

Sakos · 2024-08-18T21:18:39 1724015919

It says the timeout is refreshed if there's a cache hit within the 5 minutes, and the base cost is 10% of what it would be otherwise. Seems pretty damn useful to me. What seems useless to you exactly?

I'm primarily limited by how much context I need for my queries, and for the majority of the time, the context can often largely be the same across multiple queries over periods of 1-60 minutes. This is the case whether it's a codebase I'm working with or a PDF (or other form of text documentation).

Simple queries are where I expect there to be the least gain for this kind of thing.

knallfrosch · 2024-08-18T22:18:29 1724019509

Think beyond AI coding assistants. JSON schema definitions, FAQs, product manuals, game instructions, game state, querying a student's thesis.. anything where users query a chatbot information about something specific.

Tiberium · 2024-08-18T21:35:50 1724016950

>it assumes the exact same context window across requests That is not true, caching works across multiple requests, that's why it's so good. You can do 5 different concurrent requests and they'll all get cached and cache read if the cache is still warm for them.

minimaxir · 2024-08-18T19:57:58 1724011078

The documentation is here and has better examples: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

tl;dr can cache system prompts, tools, and user messages (up to 4 total) with better returns on massive inputs such as documents.

The use case is more for client-facing applications that would hit the cache frequently rather than internal copilots.