LLM Prompt Caching: Cut Costs and Latency by 90%
Prompt caching is the single biggest optimization for LLM applications with shared context. This guide covers how it works across providers, implementation patterns, and the tradeoffs.
View all llm api integration depths βDepth ladder for this topic:
If your LLM application sends the same system prompt, few-shot examples, or document context with every request, youβre paying full price for the same computation over and over. Prompt caching fixes this β and the savings are dramatic.
What Prompt Caching Does
When you send a prompt to an LLM, the model processes every token in the input (the βprefillβ phase) before generating output. For a request with a 10,000-token system prompt and a 100-token user query, 99% of the compute goes to processing context thatβs identical across requests.
Prompt caching stores the intermediate computation (the KV cache) from the shared prefix, so subsequent requests skip the prefill for cached tokens. The result:
- 50-90% cost reduction on cached tokens (provider-dependent)
- Dramatically lower latency (cached prefill is nearly instant)
- Same output quality (the computation is mathematically identical)
Provider Support
Anthropic (Claude)
Anthropic offers explicit prompt caching with cache breakpoints:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": large_system_prompt, # 10K+ tokens
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "User's actual question"}]
)
- Cached input tokens cost 90% less than regular input tokens
- Cache write costs 25% more than regular input (one-time overhead)
- Cache TTL is ~5 minutes (refreshed on each hit)
- Minimum cacheable prefix: 1,024 tokens (Sonnet), 2,048 tokens (Opus)
OpenAI (GPT)
OpenAI automatically caches prompts without explicit configuration:
# No special syntax needed β identical prefixes are cached automatically
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": large_system_prompt},
{"role": "user", "content": user_query}
]
)
# Check usage for cache stats
print(response.usage.prompt_tokens_details.cached_tokens)
- Cached tokens cost 50% less
- Automatic β no code changes needed
- Minimum cacheable prefix: 1,024 tokens
- Cache lifetime: varies (minutes to hours based on traffic)
Google (Gemini)
Google offers explicit context caching with longer TTLs:
from google import genai
# Create a cached context
cache = client.caches.create(
model="gemini-2.0-flash",
contents=[large_document_content],
ttl="3600s" # 1 hour
)
# Use the cache
response = client.models.generate_content(
model="gemini-2.0-flash",
contents="User's question about the document",
cached_content=cache.name
)
- Cached tokens cost 75% less
- Explicit TTL control (up to 48 hours)
- Minimum 32,768 tokens for caching
- Storage costs apply for long-lived caches
Architecture Patterns
Pattern 1: Shared System Prompt
The simplest case. Your application has one system prompt used across all requests:
[System prompt: 5K tokens] [User message: varies]
β cached β not cached
Every request after the first gets the cached rate. This is the default benefit with OpenAIβs automatic caching.
Pattern 2: Document Q&A
A user uploads a document and asks multiple questions:
[System prompt: 1K] [Document: 50K tokens] [Q1: 100 tokens]
[System prompt: 1K] [Document: 50K tokens] [Q2: 150 tokens]
[System prompt: 1K] [Document: 50K tokens] [Q3: 80 tokens]
β cached on all requests β only this varies
Without caching: 3 Γ 51,000 = 153,000 input tokens billed at full price With caching: 51,000 (first request) + 2 Γ 51,000 at cached rate + variable tokens Savings: ~60% on total input costs
Pattern 3: Multi-Turn Conversation
Each turn includes the full conversation history:
Turn 1: [System] [User1]
Turn 2: [System] [User1] [Asst1] [User2]
Turn 3: [System] [User1] [Asst1] [User2] [Asst2] [User3]
βββββββ cached prefix grows βββββββ
The cache grows with each turn. Later turns benefit most.
Pattern 4: Few-Shot Examples
Applications with shared few-shot examples in the prompt:
[System] [Example1] [Example2] ... [Example20] [User query]
β all cached β varies
Optimization Tips
1. Order your prompt strategically: Put stable content (system prompt, examples, documents) first. Put variable content (user query, conversation history) last. Caching only works on prefixes.
2. Keep the variable suffix short: The longer your uncached suffix relative to your cached prefix, the better your savings ratio.
3. Batch similar requests together: If 100 users are querying the same document, their requests all share the cached prefix. Time-clustering these requests maximizes cache hit rates.
4. Monitor cache hit rates:
# Anthropic
usage = response.usage
cache_hit_rate = usage.cache_read_input_tokens / (
usage.cache_read_input_tokens + usage.input_tokens
)
# OpenAI
cached = response.usage.prompt_tokens_details.cached_tokens
total = response.usage.prompt_tokens
cache_hit_rate = cached / total
5. Warm the cache proactively: For Anthropicβs ephemeral cache, send a lightweight request to populate the cache before user traffic arrives.
When Caching Doesnβt Help
- Unique prompts: If every request is completely different, thereβs no shared prefix to cache
- Short prompts: Below the minimum token threshold, caching doesnβt activate
- Low request frequency: If requests are too spread out, the cache expires between them
- Highly dynamic system prompts: If your system prompt changes frequently, the cache invalidates
Cost Modeling
Before optimizing, measure your actual caching opportunity:
# Analyze your request log
requests = load_request_log()
shared_prefix_lengths = []
for i in range(1, len(requests)):
shared = common_prefix_length(requests[i-1], requests[i])
shared_prefix_lengths.append(shared)
avg_cacheable = sum(shared_prefix_lengths) / len(shared_prefix_lengths)
avg_total = sum(len(r) for r in requests) / len(requests)
cache_opportunity = avg_cacheable / avg_total
print(f"~{cache_opportunity:.0%} of tokens are cacheable")
If the opportunity is >50%, prompt caching should be a priority optimization.
Simplify
β LLM Observability: Tracing, Logging, and Debugging AI Applications
Go deeper
Managing LLM API Rate Limits and Quotas in Production β
Related reads
Stay ahead of the AI curve
Weekly insights on AI β explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.