Hacker News new | past | comments | ask | show | jobs | submit login
Prompt Caching (anthropic.com)
173 points by fallinditch 84 days ago | hide | past | favorite | 68 comments



FWIW, Gemini / Vertex has this as well and lets you control the TTL. Billing is based on how long you keep the context

https://ai.google.dev/gemini-api/docs/caching?lang=python

Costs $1 / 1M / 1h


That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes. A million tokens is 4MB. Imagine paying $1/hr for less than the storage of three floppies. That's two million times more expensive than the storage cost of standard S3 (720 hours×256M tokens (1gb)×$1 vs $0.09). Or 2000 times more expensive than the storage cost of Elasticache serverless.

(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)


You have to store the KV cache, not the tokens. For Gemma 27B (probably slightly larger than Flash), this would be:

  Size of KV cache = 2 * (num_layers) * (num_kv_heads * dim_head) * seq_length * precision

  8-bit Gemma 27B KV cache = 2 * (46) * (16 * 144) * 1e6 * 1 byte ≈ 200 GB
Note that this doesn't take further optimizations into account that Google might be using.

Formula: https://developer.nvidia.com/blog/mastering-llm-techniques-i...

Gemma 27B config: https://huggingface.co/google/gemma-2-27b/blob/main/config.j...


Is there some easy to understand source / paper about how this caching works?



Ask chat gpt to explain how K-V caching works. What they are doing is essentially the same thing, with a few more engineering details.


Well that really depends where you're caching the data.

Is it a lot for caching in L1 on a chip somewhere? No that'd be wildly cheap.

Is it a lot for "caching" on a tape somewhere? Yes.

So where on this scale does keeping it quick to get to gpu memory lie?

> That's two million times more expensive than the storage cost of standard S3 (

You're not comparing to s3 at all.


"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.

Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.


Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.


You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.

If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.

When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.


I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!

> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data

Is it a gigabyte of data and is it fast enough?

You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?

The entire point here is to lower latency and costs so it has to be close and fast.

Guessing at sizes isn't helping anything here.


What is stored is not the tokens, but all keys and values of all attention layers for each token.


+1 it wouldn’t be terribly useful if it were only caching the tokenizer output.


As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.


It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.


There are some errors in you calculation.

> A token is 32-bit integer.

No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.

So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.

Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.

So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.


It's probably much more than a single GB of data for a million tokens.


It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.


It's more likely some 400GB of data.


You need to cache all the key/values on each layer, otherwise there is no point in caching.


So in your calculation all the money spent to create the tech is free?


The "cost" is the published price that Google charges for Gemini caching


By this logic the Windows or MS office should have the price slightly higher than an USB sticker


What are you trying to get at, the meaning of "cost"?

When I used the word "cost", I am referring to the price I pay, my cost, for using caching. That number is a known, published figure


Why does prompt caching reduce costs? I'm assuming that the primary cost driver is GPU/TPU FLOPS, as opposed to any network / storage / etc costs.

My understanding is that an LLM will take in the stream of text, tokenize it (can be faster with caching, sure, but it's a minor drop in the bucket), then run a transformer on the entire sequence. You can't just cache the output of a transformer on a prefix to reduce workload.


Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution.

By caching them they resume from where it left off from before thereby completely bypassing all that computation.

For large contexts this could save a ton of compute!

I think this feature and structured outputs are some of the biggest inventions in LLMs this year.


Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.


I’m building an app with OpenAI, using structured outputs. Does OpenAI also support prompt caching?


I'm sure internally they use it for the system prompt at least, probably since launch. And maybe for common initial user queries that exactly match.


They are certainly not passing the savings on to the users.


Yet. I suspect OpenAI will release a similar offering soon. (hooray, free market competition!)


That $100 billion data center has to get paid for somehow.


Not currently.


You actually can cache the "output" of a transformer on the prefix by caching what happens in the attention layer for that text string (specifically the "K" and "V" tensors). Since the attention layer is a big part of the compute cost of the transformer, this does cut down FLOPs dramatically.


Oh interesting, didn't know. How does this work past the first transformer in the stack?


My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.


I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.


The transformer only looks backwards, so if the first part of the sequence (the prompt) doesn't change, you don't need to rerun it again on that part, just on the part after it that changed. For use cases with large prompts relative to the output size (e.g. lots of examples in the prompt), this can significantly speed up the workload.


I think most architectures do a layer of normalization on the whole text embeddings before calculating attention which makes this infeasible

Shouldn’t be a huge deal to adjust imo

One of the bigger problems is that closed model providers don’t want to expose the embedding space and let’s users see what they have


I don't think the normalization makes it infeasible. They should be able to make an adjustment (the reverse of the normalization) in one operation. I think they are caching the attention calcs.

The hard thing (I think) is what to keep in the cache and where to keep it given you are serving lots of customers and the attention calc can be a large set of numbers pretty quickly.


It’s only “hard” because they don’t want to let customers supply the cache of course


I'm not sure that's it. Presumably they want to keep the cache in GPU memory?


That’s largely it imo. If you get get the embedding representations you could just recreate the model logic and then it’s no longer closed source.


no, you couldn't. If they just handed you the embedding layer it wouldn't help that much.

Plus, I'm reasonably certain they are caching the attention scores anyway.


They cache the results of the attention calc. For certain subsets which are common this makes a lot of sense. I'm surprised they can make it work though, given they are serving so many different users. Someone somewhere did some very clever engineering.


Why not? It's caching the state of the model after the cached prefix, so that inference workload doesn't need to be run again.


Setting aside efficiency or accuracy, caching enhances the value of prompt engineering and thus increases the effective value of AI services (how the value is monetized or split is TBD).

Comments suggest that caching the state of the network might also reduce processing.

I wonder if it also permits better A/B-style testing by reducing the effect of cross-domain errors. If the AI service providers made it easy to provide feedback on post-cache responses, the providers could incorporate the quality-enhancement loop accelerating time to product-market fit (at the risk of increasing dependency and reducing ability to switch).


This feature was first introduced by deepseek. And deepseek will just do it automatically for you. https://platform.deepseek.com/api-docs/news/news0802/


Yep. All the LLM providers fast follow aka copy each other within months, though


This is great news. Using Claude to build a new SaaS [1] and this will likely save me quite a bit on API costs.

[1] https://x.com/codewithparrot


I guess they got tired of losing customers to Deepseek. They introduced this feature a while ago and their prices were already miniscule given that they only have to compute 20B active parameters.


I just tried Claude the other day. What a breath of fresh air after fighting the dogshit that is OpenAI.

Far less "in the realm of", "in today's fast-moving...", multifaceted, delve or other pretentious wank.

There is still some though so they obviously used the same dataset that's overweight in academic papers. Still, I'm hopeful I can finally get it to write stuff that doesn't sound like AI garbage.

Kind of weird there's no moderation API though. Will they just cut me off if my customers try to write about things they don't like?


> Will they just cut me off if my customers try to write about things they don't like?

The response you get back will have a refusal, which is pretty standard


You can try AWS Bedrock or Openrouter if that happens. They both have the Claude API.


Prompt + Lora? Train an adapter?


Will this be making its way to Bedrock?


How is the Anthropic stuff on Bedrock in general? We're using OpenAI stuff on Azure right now and it's frustrating how slowly stuff gets rolled out in our region(s).



Sounds kinda useless, TBH. This sounds as if it assumes the exact same context window across requests. If so, given the 5 minute window, unless for example your entire team is operating in the same codebase at the same time, you won't really see any savings beyond simple prompts.

Are contexts included in the prompt cache? Are they identified as the same or not? What happens if we approach the 10k token range? 128k? 1M?


Here is a simple use-case that comes to mind. Let's say you have a medium-size repository, such that all the source files can fit in the context. You want to use Claude as an advanced auto-complete in a certain file, but it's important that it has visibility to other files in the repo.

You can put all the other source files in an initial system message and the current file in the user message. Then, if you call multiple autocompletes within 5 minutes of each other, you pay a drastically reduced price for including all of the other files in the context. Also, the latency is much reduced!

Yes, you could probably get a similar outcome by incorporating RAG, search tools etc to the autocomplete, but as a simple approach with fewer moving parts, the caching will reduce costs for this setup.


I have a system prompt of around 100k which includes a couple of internal schema definitions. Having this cached could be super useful to us.


100k … I hear and use all of these high token context LLMs, but often they fail to include all information that should be in their context.

Does your approach work for you?


Would love if you could de identify and share a bit more detail on a prompt that big.


It’s useful to consider the impact this has on conversations.

In an ongoing conversation with a model you end up re-submitting the full test of the previous conversation - both prompts and responses - at every step. This means the cost per prompt in that conversation increases each time.

Claude prompt caching can start saving you money even with just a single user having a conversation, provided each of their replies is within five minutes of the previous reply.


It says the timeout is refreshed if there's a cache hit within the 5 minutes, and the base cost is 10% of what it would be otherwise. Seems pretty damn useful to me. What seems useless to you exactly?

I'm primarily limited by how much context I need for my queries, and for the majority of the time, the context can often largely be the same across multiple queries over periods of 1-60 minutes. This is the case whether it's a codebase I'm working with or a PDF (or other form of text documentation).

Simple queries are where I expect there to be the least gain for this kind of thing.


Think beyond AI coding assistants. JSON schema definitions, FAQs, product manuals, game instructions, game state, querying a student's thesis.. anything where users query a chatbot information about something specific.


>it assumes the exact same context window across requests That is not true, caching works across multiple requests, that's why it's so good. You can do 5 different concurrent requests and they'll all get cached and cache read if the cache is still warm for them.


The documentation is here and has better examples: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

tl;dr can cache system prompts, tools, and user messages (up to 4 total) with better returns on massive inputs such as documents.

The use case is more for client-facing applications that would hit the cache frequently rather than internal copilots.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: