Guides → Playground & Guide → Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)
Meet Hiroshi Tanaka. Backend Engineer at a 30-person SaaS. "Anthropic offers prompt caching. My RAG bot has long system prompts. Worth setting up?"
🔥 $3K/mo Anthropic bill - 80% of it is input tokens that look cacheable.
Prompt caching is one of the highest-leverage optimizations available - and most teams skip it. Anthropic charges 10% of normal input price for cached tokens (with a 5-min TTL on the cache). Google Gemini caches at similar economics. For input-heavy workloads, this is a 30-50% cost cut.
Hiroshi's RAG bot has a 5K-token system prompt + 8K-token retrieved context per query. Most of that input repeats across queries - system prompt always identical, retrieved chunks often overlap (FAQs hit the same docs). At 60% effective cache hit rate, his $3K bill drops to ~$1,950.
The decision is: do your queries share enough repeated context for caching to fire? System prompts always cache (identical every time). Repeated tool definitions cache. Few-shot examples cache. RAG retrievals partially cache (overlapping chunks). User-specific context never caches.
This calc helps you estimate cache hit rate per workload type and surface whether the savings clear the setup overhead.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
What you'll see after the calculator runs. Each card explains how to read the number.
Anthropic charges 10% for cached input tokens. Find the cache-hit rate that makes setup worthwhile - and the workloads where caching saves 30-50%.
cache
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Read the savings: input × hit rate × discount. Hiroshi: $2,400 × 0.5 × 0.9 = $1,080/mo savings. From a $3K total bill, that's a 36% reduction.
Setup is small but real. Add cache-control parameters to your prompt structure (4-8 hours engineering for a single-agent app). Verify hit rate in production for 1-2 weeks before counting savings.
Cache TTL matters. Anthropic cache: 5-minute TTL by default, 1-hour with extended option. If your traffic is bursty (queries every 30 min), expect lower-than-expected hit rate. Continuous traffic patterns get the full benefit.
Watch for cache-invalidation bugs. A subtle change to system prompt = cache miss for hours. Static prompts only - version your prompts, don't string-format runtime values into the cached portion.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
$400 input spend at 25% hit rate saves $90/mo. Setup cost ~$1,200 (8hr × $150 loaded eng). Payback 13 months. Skip - invest the engineering hours elsewhere.
Healthy range: Savings ~$90/mo - barely worth setup
Hiroshi's RAG bot. $1,080/mo savings against $1,200 setup. Payback ~5 weeks. Solid investment. Plus future calls just get the discount automatically.
Healthy range: Savings $1,080/mo - payback in days
Tool-using agent with 12 function definitions repeated every call. 75% hit rate easy. $5.4K/mo savings on $8K input spend. Setup ~$2,400. Payback under 2 weeks. Mandatory.
Healthy range: Savings $5,400/mo - major win
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Caching is a Pareto-style optimization - 80% of the savings come from 20% of the prompt structure (system + tools + examples). Cache that first; don't try to cache user input.
Cached vs uncached input goes through the same model. Cache is a billing optimization, not a quality compromise. Outputs are identical.
Cached content lives within the same compliance boundary as your API tier. No additional governance burden.
Cache static prompt portions only. Never cache user-specific data - it'd cross queries and create privacy issues. Standard practice: cache up to user message; don't cache user message itself.
Bonus: cache hits are slightly faster than cache misses. Marginal but consistent.
Anthropic's prompt caching API isn't identical to Google's context caching. Multi-vendor abstraction needs to handle this. LiteLLM has it; custom abstractions need explicit handling.
Cache hit rate is now a metric you care about. Drop in hit rate (e.g., from 60% to 30%) usually means a prompt got reformatted. Add to observability.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
5K-token system prompt + 4-doc retrievals. System prompt 100% cacheable. Retrievals ~30-50% overlap (FAQs repeat). Combined ~60% hit rate. Savings $1,890/mo.
Healthy range: Strong cache fit - system prompt + repeated chunks
Agent with 15 tool definitions = 8K-12K tokens of static input every turn. Cache hit rate 80%+. Savings $3,600/mo on $5K input.
Healthy range: Excellent fit - tool defs are huge + always identical
Classification with 20 few-shot examples = ~6K tokens of static prompt every query. Examples never change. Hit rate 85%. Savings $918/mo on $1.2K. Setup minimal.
Healthy range: Excellent fit - examples are static
Code review bot - every PR has different code. 20% cache hit (system prompt only, no repeated user content). Savings $360/mo. Setup $1,200. Payback 3-4 months. Marginal - invest if you have spare eng cycles.
Healthy range: Low fit - user-driven content dominates input
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Cost Calculator for full bill breakdown. Token Reduction Analyzer for further optimization.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →