There is a moment in the life of every production LLM system where someone looks at the invoice, then at the traffic logs, and asks: “Are we paying to answer the same question twice?”
The answer is almost always yes. And the answer to “by how much” determines whether caching is a minor optimization or a structural rewrite of the cost model.
LLM caching is the most leveraged cost lever in production AI. Anthropic and OpenAI prompt caches now offer up to a 90% discount on cached input tokens. Semantic caches, when tuned correctly, eliminate 30–70% of LLM calls outright. And both are easy to get wrong in ways that quietly serve stale, leaked, or incorrect answers to real users.
This guide is the production playbook: the three layers of LLM caching, the hit-rate economics that decide whether each one pays off, and the invalidation pitfalls that trip up teams the first time they deploy a cache to real traffic. It is one piece of the system underneath the chat box, and part of the larger question of why your AI experiments are failing once they scale beyond a demo.
The Three Layers of LLM Caching
Most engineering teams reach for “a cache” as if it is one thing. In production LLM systems it is three things, operating at different layers, with different hit-rate profiles, different invalidation rules, and different failure modes.
| Layer | What it caches | Where it lives | Typical savings |
|---|---|---|---|
| Exact-match cache | Identical prompt → identical response | Your infrastructure (Redis, Memcached, in-process) | 100% on hits — no model call at all |
| Semantic cache | Similar prompt → reuse of past response | Your infrastructure (vector store + embedding model) | 100% on hits — no model call at all |
| Provider prompt cache (KV cache) | Reused prefix of a single prompt | Provider infrastructure (Anthropic, OpenAI, Google) | ~90% off cached input tokens |
These layers stack. A well-architected system checks the exact-match cache first, then the semantic cache, and only on a miss does it hit the model — where the provider’s prompt cache then discounts the long, repeated prefix portion of the request. Each layer has its own design and its own way to fail.
Layer 1: Exact-match caching
The simplest layer. Hash the full prompt (including system message, conversation history, retrieved context, and any parameters that affect output — temperature, top_p, model, seed) and store the response under that key.
Where it wins: high-traffic systems with repeating queries. Status check bots, FAQ answerers, internal tools where ten users a day ask “what’s the deploy command for prod?” Hit rates in the 20–40% range are common in chat assistants and customer-facing FAQs.
Where it loses: anything with high prompt variability. If every prompt includes a timestamp, user ID, or session token in the system message, every hash is unique and the cache hit rate is zero. The fix is to normalize: hash only the parts of the prompt that actually determine the answer, not the entire payload.
The infrastructure is mature and boring. Redis or Memcached with a TTL is fine. The interesting design choices are: what is the cache key, what is the TTL, and what triggers invalidation.
Layer 2: Semantic caching
Semantic caching is the layer most teams get wrong on the first deploy.
The idea: embed the incoming prompt, search a vector store for past prompts whose embeddings are within a cosine similarity threshold, and return the cached response if a match is found. Two prompts do not have to be byte-identical — they just have to mean approximately the same thing.
The mechanics, as documented across production deployments: client → embed query (~3–8ms) → vector search → cache hit returns immediately, cache miss calls the LLM and stores the new pair. Tools commonly in use include GPTCache, Redis with vector search, and managed semantic caches inside gateways like Portkey.
The economics are real. Published research on GPT Semantic Cache reports cache hit rates of 61.6% to 68.8% across query categories, with API call reductions up to 68.8%. Other production deployments report 30–70% depending on traffic mix. Industry guides cite ~60% API bill reductions as a realistic top end when threshold tuning is done well.
But semantic caching introduces three failure modes that exact-match caching does not have:
- False positives. Two prompts that are semantically similar but functionally different. “What is the refund policy?” and “What is the cancellation policy?” share embedding space but may have different answers. A loose threshold serves the wrong answer with full confidence.
- Latency overhead. As production engineers note, vector search on a remote Redis instance adds 20–50ms per request, and at p99 you might see 100ms. That overhead only pays off if your hit rate is above 15–20%. Below that, semantic caching is a net latency loss.
- Cold-start uselessness. A semantic cache with no entries is just overhead. Hit rate climbs with traffic; it does not start there.
The threshold is everything. Most teams start with conservative settings (cosine similarity ~0.95+) on first deploy, accept a 5–10% hit rate with a false positive rate under 0.5%, then tune downward as confidence grows. The wrong move is to ship at 0.85 to chase a 60% hit rate on day one — that is how you wake up to support tickets about confidently wrong answers.
Semantic caching is not safe by default
A semantic cache will happily return a refund policy when the user asked about cancellation if the threshold is too loose. Treat it like any other ML system: start conservative, instrument the false-positive rate explicitly, and only loosen the threshold when you have data showing it stays safe. Never deploy a semantic cache without a way to disable it instantly.
Layer 3: Provider prompt caching (and KV cache)
This is the newest layer and the most overlooked. Anthropic, OpenAI, and Google all now offer prompt caching at the API level — discounting reused prefixes of your prompts at the inference level.
The mechanism is the model’s KV cache, exposed as a feature. When you send a long prompt — a multi-page system instruction, a retrieved document, an extended conversation history — the provider computes the attention key-value tensors for that prefix and can reuse them on subsequent requests that share the same prefix. You pay a small premium to write the cache and a large discount to read it.
Anthropic’s prompt caching pricing makes the math concrete:
| Operation | Cost vs. base input |
|---|---|
| Cache write (5-minute TTL) | 1.25× |
| Cache write (1-hour TTL) | 2× |
| Cache read | 0.1× (i.e., 90% off) |
So a cached input token costs 10% of the standard rate. The break-even is one cache read for the 5-minute tier and two reads for the 1-hour tier. Production guidance from 2026 reports 85–90% reduction on cached input across teams that put it into reused system prompts and document contexts. Importantly, the KV cache representations themselves are held in memory only, not stored at rest, with a 5-minute (standard) or 1-hour (extended) lifetime.
Where it wins: long, stable prefixes. RAG pipelines where the same retrieved chunk shows up across many requests. Agent loops where the same system prompt and tool definitions are sent every turn. Multi-turn conversations where the early turns repeat across follow-up requests.
Where it loses: highly variable prompts where there is no shared prefix. Single-turn classifications. Anything with the variable content at the start instead of the end.
The implication for prompt design is concrete: put the stable, repeated content (system prompt, tool definitions, retrieved documents) at the beginning of the prompt, and put the variable content (user message, current turn) at the end. This is unintuitive if your prior experience is with traditional caches but it is how transformer KV caches work — they cache from the left.
We cover provider-side pricing mechanics in depth in our breakdowns of Anthropic API pricing and the true cost of Google Gemini. Prompt caching is the largest single line-item discount available in either pricing model. If you are spending five figures a month on LLMs and not using it, that is where to start.
Hit-Rate Economics: When Each Layer Pays Off
Caching is an engineering investment. The question is when it actually pays back.
The break-even point depends on three things: your cost per LLM call, the overhead cost of the cache (latency, infrastructure, engineering time), and your hit rate. A useful mental model:
- Exact-match cache pays off above ~5% hit rate. Overhead is negligible — a Redis GET. There is almost no scenario where exact-match caching is a net loss if you have any repeating queries at all.
- Semantic cache pays off above ~15–20% hit rate, because the embedding + vector search overhead is real (20–50ms p50, up to 100ms p99). Below that threshold, you are adding latency to every miss for marginal savings.
- Provider prompt cache pays off after the first or second read on the same cached prefix, depending on TTL. The break-even is request-level, not aggregate.
The cost ratio across the layers also matters. If your traffic is dominated by long system prompts and short user messages — typical of RAG and agentic systems — the provider prompt cache is the highest-impact layer. If your traffic is repetitive customer questions where the whole prompt is similar to a past prompt, semantic caching dominates.
In a well-instrumented system, all three layers contribute, and you watch their hit rates as independent metrics. Aggregating them hides the layer that is silently broken.
Cache Invalidation: Where Production Caches Go to Die
Phil Karlton’s law applies double to LLM caches. There are only two hard problems: cache invalidation, naming things, and off-by-one errors. The LLM-specific failure modes:
Stale answers after a source-of-truth change. A user asks “what is your return policy?” — answer cached. Marketing updates the return policy on the website. The cache keeps serving the old answer for the rest of the TTL. The user reads it and ships back a product you no longer accept returns on. The fix: invalidate by source. When a retrieved document is updated, evict all cached entries that touched it. This means your cache keys (or a secondary index) need to track which sources contributed to each cached answer.
Stale answers after a prompt change. You ship a new system prompt that changes tone, output schema, or behavior. The cache has thousands of entries generated under the old system prompt. New traffic gets cache hits that look nothing like what your new prompt is producing. The fix: include a hash of the system prompt and prompt version in the cache key. A prompt change automatically invalidates the cache.
Stale answers after a model change. Same problem, different cause. You move from GPT-5 to GPT-5.1 (or your routing layer does it for you). Cached responses were generated by the old model. The fix: include the model snapshot in the cache key. We cover the broader model versioning discipline — including why you should pin to dated snapshots, not aliases — in our LLM routing production guide.
Tenant leakage. Two tenants ask functionally identical questions. The semantic cache happily serves Tenant A’s cached answer to Tenant B — including any tenant-specific data baked into the original answer. The fix: tenant ID is always part of the cache key for multi-tenant systems. Always.
PII in cache values. A user provides their email or phone number in a query. The response references that PII. The full response gets cached. Six hours later, a different user with a semantically similar query gets a cache hit and reads someone else’s contact info. The fix: redact PII from prompts before they hit the cache, or do not cache responses to PII-containing prompts at all.
The cache key is the security boundary
For multi-tenant systems, the cache key is a security primitive. Tenant ID, role, and PII redaction policy all belong in it. A semantic cache that does not scope by tenant is a cross-tenant data leak waiting to happen — it just has not happened to you yet.
A Production Caching Architecture
The stack that actually works in production:
[request]
↓
[exact-match cache] ─── hit → return
↓ miss
[semantic cache] ─── hit → return
↓ miss
[router + gateway]
↓
[provider call with prompt caching enabled]
↓
[response] → write to both caches → return
Both internal caches live inside the gateway layer so they are co-located with routing, fallbacks, and rate limiting. The provider’s prompt cache is handled by enabling cache control on the long, stable prefix portions of the prompt — you do not run it; the provider does.
Beyond the caches themselves, the supporting pieces:
- Per-tenant cache scoping is enforced at the key level.
- Cache hit/miss/false-positive metrics flow into your observability stack and your cost attribution pipeline so you can see savings per route and per tenant.
- A kill switch. Every cache layer has a feature flag to disable it instantly. When you ship a new prompt or model and want to rebuild the cache from clean traffic, you flip the flag.
- An invalidation API. Programmatic eviction by source document, by prompt version, by model version, by tenant. Used both during routine updates and during incident response.
This is the same architectural pattern we apply across our Operational AI engagements: cache is treated as a first-class system component with its own observability, its own kill switch, and its own ownership — not a Redis instance someone set up in a hackathon.
What to Measure
The cache is invisible until something goes wrong, at which point you need every signal it has. The minimum metrics:
- Hit rate per layer — exact, semantic, provider — tracked separately. An aggregate hit rate hides which layer is failing.
- False-positive rate on the semantic cache — sampled human or LLM-judge review of cached responses vs. fresh responses.
- Cache-induced latency — p50 and p99 latency on cache hit and miss paths, separately. Tells you whether the cache is paying back.
- Cost saved per route — the dollar-value version of hit rate. The number you put in front of finance to justify the engineering time.
- Eviction rate and reason — TTL expiry, manual invalidation, source change, prompt version change.
- Cache size and growth — for capacity planning.
These tie directly to the broader AI cost optimization story. Caching is the most leveraged single tool in that story, but only if you can prove it is working.
Make Caching a Structural Cost Control, Not an Afterthought
If your LLM bill is climbing faster than your traffic, caching is almost certainly under-deployed. Our engineers can help architect a three-layer cache, tune it against your actual traffic, and instrument it so the savings show up on your dashboard, not just in your invoice.
The Decision Framework
A pragmatic order of operations for teams that have not yet built their caching layer:
- Turn on provider prompt caching first. It is the cheapest engineering work, the largest single discount, and it is safe — the provider handles invalidation. Restructure long, repeating prompts to put stable content at the front.
- Add exact-match caching second. A Redis instance and a normalized cache key. Low risk, immediate wins on any traffic with repeating queries.
- Add semantic caching last, and conservatively. Start with a tight threshold. Instrument false positives. Loosen the threshold only when you have data showing it is safe. Always scope by tenant.
- Wire everything into observability and cost attribution. A cache you cannot measure is a cache you cannot trust.
The teams that get this right report 60–90% cost reductions on cached portions of their workload and meaningful latency improvements on the hit path. The teams that ship semantic caching at a loose threshold on day one end up rolling it back after the first cross-tenant or stale-data incident.
This is one layer of the system underneath the chat box — the gap between an impressive demo and production AI. Routing, rate limiting, cost attribution, and caching are the four cost-control levers; caching is the largest of them. It pairs especially closely with LLM routing, since cache hits skip the routing decision entirely, and with LLM cost attribution, which is how you prove the savings to the people writing the checks.
Frequently Asked Questions About LLM Caching
What is the difference between exact-match, semantic, and prompt caching?
Exact-match caching returns a stored response when an incoming prompt is byte-identical to a past prompt. Semantic caching returns a stored response when an incoming prompt is similar in meaning to a past prompt, using vector embeddings and a similarity threshold. Provider prompt caching (sometimes called KV caching) is a feature offered by Anthropic, OpenAI, and Google that discounts the reused prefix portion of a prompt at the inference layer — typically around 90% off on cached input tokens. The three layers stack: exact-match first, then semantic, then provider prompt caching on the miss path.
How much can I actually save with LLM caching?
Realistic ranges: exact-match caches typically eliminate 20-40% of LLM calls in high-traffic chat or FAQ systems. Semantic caches eliminate an additional 30-70% depending on traffic mix and threshold tuning. Provider prompt caching discounts cached input tokens by approximately 90% — a 5-10x reduction on the long, stable parts of your prompts. Combined, production teams routinely report 60-90% cost reductions on cached portions of their workload. The catch is hit rate: a semantic cache with a 5% hit rate is a net cost in latency.
Is semantic caching safe to use in production?
It can be, but it is not safe by default. Semantic caches have a real false-positive rate — two prompts that are similar in embedding space but functionally different (refund vs. cancellation policy, for instance) can serve the wrong cached answer with full confidence. Safe deployment means starting with a tight similarity threshold (around 0.95 cosine), explicitly measuring the false-positive rate with sampled review, scoping cache keys by tenant for multi-tenant systems, and having a kill switch to disable the cache instantly. Loose thresholds in pursuit of high hit rates are how you ship a cross-tenant or stale-data incident.
How does Anthropic prompt caching pricing work?
Per Anthropic's documentation, cache writes cost 1.25 times the base input token price for the 5-minute TTL and 2 times for the 1-hour TTL. Cache reads cost 0.1 times the base input token price — a 90% discount. The break-even is one cache read for the 5-minute tier and two reads for the 1-hour tier. KV cache representations are held in memory only and are not stored at rest. The practical implication is to put stable content (system prompts, tool definitions, retrieved documents) at the start of the prompt, since the cache works left-to-right.
When should I NOT use caching?
A few cases. When responses must be unique per request (real-time stock prices, personalized content generation that must not repeat). When prompts are dominated by variable content with no repeating prefix — provider prompt caching has nothing to cache. When traffic is too low to amortize the cache infrastructure cost. When responses contain PII that cannot safely be reused across users — in this case, redact before caching or do not cache. And when you are in the first week of a new prompt or model and want clean traffic to evaluate against, you turn caching off behind a feature flag.
How do I invalidate the LLM cache when my source data changes?
Build invalidation into the cache key from day one. The cache key should include: the prompt template hash (changes invalidate on prompt updates), the model snapshot ID (changes invalidate on model updates), the tenant ID (prevents cross-tenant hits), and a list of source document IDs that contributed to the response (allows targeted eviction when a source updates). On a source update, your invalidation API evicts all cached entries that reference that source. Without this design, your only invalidation tool is TTL expiry, which means stale answers continue serving until the timer runs out.