Guides → Playground & Guide → Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

Meet Hiroshi Tanaka. Backend Engineer at a 30-person SaaS. "Anthropic offers prompt caching. My RAG bot has long system prompts. Worth setting up?"

🔥 $3K/mo Anthropic bill - 80% of it is input tokens that look cacheable.

The story

Prompt caching is one of the highest-leverage optimizations available - and most teams skip it. Anthropic charges 10% of normal input price for cached tokens (with a 5-min TTL on the cache). Google Gemini caches at similar economics. For input-heavy workloads, this is a 30-50% cost cut.

Hiroshi's RAG bot has a 5K-token system prompt + 8K-token retrieved context per query. Most of that input repeats across queries - system prompt always identical, retrieved chunks often overlap (FAQs hit the same docs). At 60% effective cache hit rate, his $3K bill drops to ~$1,950.

The decision is: do your queries share enough repeated context for caching to fire? System prompts always cache (identical every time). Repeated tool definitions cache. Few-shot examples cache. RAG retrievals partially cache (overlapping chunks). User-specific context never caches.

This calc helps you estimate cache hit rate per workload type and surface whether the savings clear the setup overhead.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Model — The cache-capable model running your workload. Cache economics differ sharply by provider, so this choice decides whether caching is free upside or a gamble.

How to choose: Anthropic (Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5): 90% read discount but a 25% write premium → you must clear a break-even hit rate to save. OpenAI (GPT-5.5) and Google (Gemini 3.1 Pro / Flash): no write premium → any hit rate above 0% saves. Match the model to the work, then read the break-even bar.

▸ Requests per month — Total API requests per month. Scales the absolute dollar impact — the per-request economics are the same, but volume decides whether the savings are worth the setup hours.

How to choose: Pull the real number from your provider dashboard, or estimate requests/day × 30. Below ~50K/mo, even a great hit rate may not clear the engineering cost of wiring up cache-control.

▸ Input tokens / request — Average input tokens per request. Caching only discounts INPUT tokens, so this is the lever that sizes your savings — bigger inputs, bigger absolute win.

How to choose: Typical ranges: RAG bots 4K-12K (system prompt + retrieved docs), tool-using agents 8K-12K (large static tool definitions), one-off chat under 1K. Use your real average, not a peak.

▸ Output tokens / request — Average output tokens per request. Output is never cacheable, so it is pure uncachable cost — heavy-output workloads see a smaller percentage savings even when input caches perfectly.

How to choose: Set to your real average completion length. If output dwarfs input (e.g. long generation from a short prompt), caching helps less; if input dwarfs output (RAG, classification), caching helps most.

▸ Cacheable portion of input — The fraction of your input that is a stable, reusable prefix — system prompt, tool definitions, few-shot examples, repeated docs. Only this portion can ever hit cache.

How to choose: System prompts, tool defs and few-shot examples are 100% cacheable; user-specific content is 0%. Rough fits: RAG ~60%, long system prompts ~90%, tool agents ~85%, one-off chat ~20%. Put static content first in the prompt — caches match the longest common prefix.

▸ Cache hit rate — The fraction of requests that arrive while the cached prefix is still warm. This is the number that decides win-or-lose on write-premium providers like Anthropic.

How to choose: Driven by cache TTL vs the gap between calls (Anthropic 5-min default / 1-hour extended; Gemini configurable). Real-world: tool-using agents 70-90%, RAG/agent workloads 40-70%, chatty/bursty traffic 20-50%. Estimate conservatively, then measure in production for 1-2 weeks.

📊 Outputs computed for you

What you'll see after the calculator runs. Each card explains how to read the number.

▸ No-cache cost — What this workload costs per month with caching disabled — your baseline.

How to read: This is the number caching has to beat. Compare it against the with-cache card to see the dollar gap.

▸ With-cache cost — Estimated monthly cost with caching on, blended across your hit and miss rates.

How to read: If this is below the no-cache card, caching wins. On Anthropic it can sit ABOVE the baseline when your hit rate is under break-even — that is the write premium biting.

▸ Monthly savings — No-cache cost minus with-cache cost — the headline dollar result, monthly and annualized.

How to read: Positive (green) = caching pays. Negative (red) = caching is costing you money at this hit rate; raise the hit rate or turn cache off for this model.

▸ Break-even hit rate — The minimum cache hit rate at which caching stops losing money on the selected model.

How to read: On no-write-premium providers (OpenAI, Google) this is effectively 0% — caching always saves. On Anthropic it is typically ~22%; your YOU marker must sit to the right of it to be in the green.

About this calculator: Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

Anthropic charges 10% for cached input tokens. Find the cache-hit rate that makes setup worthwhile - and the workloads where caching saves 30-50%.

Inputs you control

Input	Impact on result	Range	Typical
Monthly input-token spend ($)	From your bill - input tokens specifically (output not cached). Roughly: total bill × input share. RAG bots: 70-90% input. Chat bots: 50-70% input.	50 – 50K	2400
Cache hit rate (0-1)	Fraction of input tokens that hit cache. Conservative estimate. Most teams achieve 40-70% on RAG/agent workloads, 70-90% on tool-using agents with repeated function defs.	0 – 0.95	0.5
Cached token discount (% off)	Anthropic charges 10% for cached input = 90% off. Google Gemini context caching: similar. Set to 90 unless your vendor is different.	50 – 95	90

Outputs computed for you · model: `cache`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Monthly input-token spend ($) 2,400

From your bill - input tokens specifically (output not cached). Roughly: total bill × input share. RAG bots: 70-90% input. Chat bots: 50-70% input.

Estimated: —

Cache hit rate (0-1) 0.5

Fraction of input tokens that hit cache. Conservative estimate. Most teams achieve 40-70% on RAG/agent workloads, 70-90% on tool-using agents with repeated function defs.

Estimated: —

Cached token discount (% off) 90

Anthropic charges 10% for cached input = 90% off. Google Gemini context caching: similar. Set to 90 unless your vendor is different.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Read the savings: input × hit rate × discount. Hiroshi: $2,400 × 0.5 × 0.9 = $1,080/mo savings. From a $3K total bill, that's a 36% reduction.

Setup is small but real. Add cache-control parameters to your prompt structure (4-8 hours engineering for a single-agent app). Verify hit rate in production for 1-2 weeks before counting savings.

Cache TTL matters. Anthropic cache: 5-minute TTL by default, 1-hour with extended option. If your traffic is bursty (queries every 30 min), expect lower-than-expected hit rate. Continuous traffic patterns get the full benefit.

Watch for cache-invalidation bugs. A subtle change to system prompt = cache miss for hours. Static prompts only - version your prompts, don't string-format runtime values into the cached portion.

What "good" looks like:

Strong fit: RAG, agents with tools, repeated few-shot - 50-80% hit rate likely
Moderate fit: Chatbots with system prompt - 30-50%
Limited fit: Highly varied user inputs, no repeated context - <20%
Skip caching: <$500/mo input spend OR <20% hit rate (savings not worth setup)

Vendors with prompt caching support

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$310.00 / month ≈ $3,720 / year

$400 input spend at 25% hit rate saves $90/mo. Setup cost ~$1,200 (8hr × $150 loaded eng). Payback 13 months. Skip - invest the engineering hours elsewhere.

Healthy range: Savings ~$90/mo - barely worth setup

See inputs used

monthlyInputSpendUsd: 400
estimatedCacheHitRate: 0.25
cachedTokenDiscountPct: 90
setupHoursEng: 8

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Cache the system prompt (always) Highest leverage, lowest effort
Cache tool definitions (agents) Massive savings on agent workloads
Cache few-shot examples High hit rate, simple to implement

Caching is a Pareto-style optimization - 80% of the savings come from 20% of the prompt structure (system + tools + examples). Cache that first; don't try to cache user input.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$1,610 / month ≈ $19,320 / year

5K-token system prompt + 4-doc retrievals. System prompt 100% cacheable. Retrievals ~30-50% overlap (FAQs repeat). Combined ~60% hit rate. Savings $1,890/mo.

Healthy range: Strong cache fit - system prompt + repeated chunks

See inputs used

monthlyInputSpendUsd: 3,500
estimatedCacheHitRate: 0.6
cachedTokenDiscountPct: 90
setupHoursEng: 6

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Cache hit rate is estimated - measure actual rate after deployment.
TTL constraints (5-min default, 1-hour extended) reduce effective hit rate for low-traffic windows.
Doesn't model the cache-population miss (first call after TTL expires costs full price).
Cached token discount varies by vendor (Anthropic 90%, Google ~75%, OpenAI doesn't have direct equivalent).
Some workloads have hidden static portions that aren't being cached because of prompt formatting.

For these, use: Cost Calculator for full bill breakdown. Token Reduction Analyzer for further optimization.

Where to go next

Cut tokens further (compression, distillation) →

After caching, look for token-level reductions.

Stack savings with batch processing →

Cached + batch = 70%+ off list.

Route to cheap models for simple queries →

Cache + cheap model for easy stuff = best per-query economics.

Methodology

Source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Extraction: Cache pricing from Anthropic, Google docs. Hit-rate benchmarks from 8 production migrations (anonymized).
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

Inputs you control

Outputs computed for you · model: `cache`

What you're looking at

Ready to run the numbers?

Reading your result

Vendors with prompt caching support

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Prompt Cache ROI - Cache or Not? (with Real Hit-Rate Math)

Inputs you control

Outputs computed for you · model: cache

What you're looking at

Ready to run the numbers?

Reading your result

Vendors with prompt caching support

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `cache`