That pricing is ridiculous. A token is essentially a 32 bit integer. Four bytes. A million tokens is 4MB. Imagine paying $1/hr for less than the storage of three floppies. That's two million times more expensive than the storage cost of standard S3 (720 hours×256M tokens (1gb)×$1 vs $0.09). Or 2000 times more expensive than the storage cost of Elasticache serverless.
(Yes, I realize it's probably more than 4MB, but it's still an outrageously high markup. They could do their own caching, not tell you they're doing it, and keep the difference and make even more money)
"RAM near a GPU" is ~the same cost as "RAM near literally any other piece of hardware". Even if it has to traverse the network, that's a fairly low, fairly fixed (in, say, the same rack) cost. Hell, it's probably even fast enough to use an NVME disk.
Google can search the entire Internet in a fraction of a second, they can keep a million tokens within a few dozen milliseconds of a GPU for less than a dollar an hour.
Is that fast enough? And how much data is being stored? They're not storing the tokens you pass on but the activations after processing them. I'll take a wild stab that the activations for Claude 3.5 aren't anywhere near 4 meg.
You've already frowned on my comparison to S3 but I think it's apt: it's many times more expensive than S3, but (for even a gigabyte of activations) it doesn't even need to be two orders of magnitude faster than standard S3.
If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data, it's still almost an order of magnitude more expensive than an in-memory cache adjacent to the inference boxes.
When your managed cache costs right times as much as a general purpose managed cache _in the cloud_, you've jumped the shark on pricing.
I do not understand your first paragraph. It's more expensive than s3 and it's entirely different. It's cheaper than writing on vellum, and faster!
> If you use the Elasticache pricing, which is $0.125/gb per hour, it's still eight times more expensive. So even if a million tokens is a full gigabyte of data
Is it a gigabyte of data and is it fast enough?
You've guessed 4mb and 1gb. What's the actual data size here? What speed do you need to get it into the GPU ram?
The entire point here is to lower latency and costs so it has to be close and fast.
As I pointed out, even if it's a gig of data that's still almost an order of magnitude more than the cost of a managed in-memory cache in the cloud. That's wild.
It’s probably more. Pretty conservatively, if the KV embedding dimension for each token is ~10K x 100 attention layers (this is roughly the scale of Llama3.1 405B) that’s already 1M 16-bit floats per token = 2MB. They have likely needed to implement some kind of KV compression (like DeepSeek) to make this even feasible.
No, in transformer, token is a vector, for larger models it is probably something like 6k-12k floats, assuming larger model sizes. Assume 8-bit precision, a token is more like 6-12kB, per token.
So assume 100k tokens, you will end up with 554MB for input tokens, ALONE.
Depending on your model architecture, the memory could vary, but from my observation, the runtime memory increase is at least on the same magnitude with the initial amount of memory usage upon loading the model, and this is for a moderate context length (<32k), and will grow linearly, if we don't count the n*n KV matrices.
So you are easily looking at caching 10~100GB of data, in a very hot state, and that is going to be very expensive indeed.
It's not. You're confused and don't understand what you're talking about. Just read all the replies you've already gotten and try to understand why you're wrong instead of doubling down on an incorrect take.
Why does prompt caching reduce costs? I'm assuming that the primary cost driver is GPU/TPU FLOPS, as opposed to any network / storage / etc costs.
My understanding is that an LLM will take in the stream of text, tokenize it (can be faster with caching, sure, but it's a minor drop in the bucket), then run a transformer on the entire sequence. You can't just cache the output of a transformer on a prefix to reduce workload.
Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.
You actually can cache the "output" of a transformer on the prefix by caching what happens in the attention layer for that text string (specifically the "K" and "V" tensors). Since the attention layer is a big part of the compute cost of the transformer, this does cut down FLOPs dramatically.
My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.
This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.
I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.
The transformer only looks backwards, so if the first part of the sequence (the prompt) doesn't change, you don't need to rerun it again on that part, just on the part after it that changed. For use cases with large prompts relative to the output size (e.g. lots of examples in the prompt), this can significantly speed up the workload.
I don't think the normalization makes it infeasible. They should be able to make an adjustment (the reverse of the normalization) in one operation. I think they are caching the attention calcs.
The hard thing (I think) is what to keep in the cache and where to keep it given you are serving lots of customers and the attention calc can be a large set of numbers pretty quickly.
They cache the results of the attention calc. For certain subsets which are common this makes a lot of sense. I'm surprised they can make it work though, given they are serving so many different users. Someone somewhere did some very clever engineering.
Setting aside efficiency or accuracy, caching enhances the value of prompt engineering and thus increases the effective value of AI services (how the value is monetized or split is TBD).
Comments suggest that caching the state of the network might also reduce processing.
I wonder if it also permits better A/B-style testing by reducing the effect of cross-domain errors. If the AI service providers made it easy to provide feedback on post-cache responses, the providers could incorporate the quality-enhancement loop accelerating time to product-market fit (at the risk of increasing dependency and reducing ability to switch).
I guess they got tired of losing customers to Deepseek. They introduced this feature a while ago and their prices were already miniscule given that they only have to compute 20B active parameters.
I just tried Claude the other day. What a breath of fresh air after fighting the dogshit that is OpenAI.
Far less "in the realm of", "in today's fast-moving...", multifaceted, delve or other pretentious wank.
There is still some though so they obviously used the same dataset that's overweight in academic papers. Still, I'm hopeful I can finally get it to write stuff that doesn't sound like AI garbage.
Kind of weird there's no moderation API though. Will they just cut me off if my customers try to write about things they don't like?
How is the Anthropic stuff on Bedrock in general? We're using OpenAI stuff on Azure right now and it's frustrating how slowly stuff gets rolled out in our region(s).
Sounds kinda useless, TBH. This sounds as if it assumes the exact same context window across requests. If so, given the 5 minute window, unless for example your entire team is operating in the same codebase at the same time, you won't really see any savings beyond simple prompts.
Are contexts included in the prompt cache? Are they identified as the same or not? What happens if we approach the 10k token range? 128k? 1M?
Here is a simple use-case that comes to mind.
Let's say you have a medium-size repository, such that all the source files can fit in the context. You want to use Claude as an advanced auto-complete in a certain file, but it's important that it has visibility to other files in the repo.
You can put all the other source files in an initial system message and the current file in the user message. Then, if you call multiple autocompletes within 5 minutes of each other, you pay a drastically reduced price for including all of the other files in the context. Also, the latency is much reduced!
Yes, you could probably get a similar outcome by incorporating RAG, search tools etc to the autocomplete, but as a simple approach with fewer moving parts, the caching will reduce costs for this setup.
It’s useful to consider the impact this has on conversations.
In an ongoing conversation with a model you end up re-submitting the full test of the previous conversation - both prompts and responses - at every step. This means the cost per prompt in that conversation increases each time.
Claude prompt caching can start saving you money even with just a single user having a conversation, provided each of their replies is within five minutes of the previous reply.
It says the timeout is refreshed if there's a cache hit within the 5 minutes, and the base cost is 10% of what it would be otherwise. Seems pretty damn useful to me. What seems useless to you exactly?
I'm primarily limited by how much context I need for my queries, and for the majority of the time, the context can often largely be the same across multiple queries over periods of 1-60 minutes. This is the case whether it's a codebase I'm working with or a PDF (or other form of text documentation).
Simple queries are where I expect there to be the least gain for this kind of thing.
Think beyond AI coding assistants. JSON schema definitions, FAQs, product manuals, game instructions, game state, querying a student's thesis.. anything where users query a chatbot information about something specific.
>it assumes the exact same context window across requests
That is not true, caching works across multiple requests, that's why it's so good. You can do 5 different concurrent requests and they'll all get cached and cache read if the cache is still warm for them.
https://ai.google.dev/gemini-api/docs/caching?lang=python
Costs $1 / 1M / 1h