Guides → Playground & Guide → Context Window Cost - When Long-Context Doubles Your Bill
Meet Hannah Park. Senior Engineer at a doc-analysis startup. "Gemini 1M context lets us pass entire codebases. Should we, or is RAG cheaper?"
🔥 First long-context experiment cost $80 for one task. Could not be production-ready math.
Long-context windows are a UX leap and a cost trap. Gemini 3 Pro: 2M tokens. Claude Sonnet 4.6: 1M. GPT-5: 200K (cached cheap). The temptation: 'just stuff the whole codebase / corpus / docset into the prompt.' The math: that's $5-50 per query depending on model and length.
Hannah's experiment: 800K tokens of code in context, 5K-token analysis output. On Gemini 3 Pro: $1.20 input + $0.10 output ≈ $1.30/query. Sounds fine - until 100 queries/day = $3,900/mo. On Sonnet 4.6: $2,400 input + $75 output = $2,475 first query (un-cached). With caching: $250 cache write + $25 cache read per repeat. Massively cheaper IF queries hit the same cache window.
Three regimes for long-context decisions. (1) Single-shot (analyze this 500-page document once): long-context wins on simplicity. (2) Repeated queries on same context (Q&A over same codebase): caching dominates economics. (3) Diverse queries on different contexts: RAG with retrieval beats long-context.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
What you'll see after the calculator runs. Each card explains how to read the number.
1M-token context windows enable new use cases - and double your bill. Find the threshold where chunking + RAG beats long-context, and where it doesn't.
token
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Naive long-context cost is huge. 800K input × $3/1M = $2.40 per query. Without caching, $7.2K/mo at 100/day. Most teams can't afford this.
With caching, math changes dramatically. Cache write (first query): full price. Cache reads (repeats): 10% of normal. At 70% hit rate, effective input cost drops to ~$0.30/query. Total ~$900/mo.
RAG comparison. Same workload via RAG: 8K retrieved tokens per query × $3/1M × 100/day = $24/mo. 30× cheaper than cached long-context. Quality may differ - long-context can find connections RAG misses.
The real question: do you NEED full context? If yes (cross-document reasoning, code architecture), pay for it. If no (specific answer to specific question), use RAG. Most workloads don't need full context - they think they do.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
50 different long PDFs analyzed daily, each one once. No cache benefit. ~$900/mo on Sonnet. Long-context is the right call here - RAG would lose too much cross-document context.
Healthy range: $700-1,200/mo at 50 unique docs/day
Hannah's code analyzer with caching. 70% hit rate on cached codebase context. Cache write cost amortizes. Effective ~$900/mo. Without caching: $5K+/mo. Caching is mandatory.
Healthy range: $700-1,200/mo with cache
1K diverse queries/day across a corpus. RAG retrieves ~8K tokens per query. Effectively replaces long-context. Saves 90%+ vs stuffing full corpus. Only works if RAG retrieval quality is good enough.
Healthy range: $200-400/mo with RAG retrieval
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Long-context cost compounds catastrophically fast. Math the per-query cost × volume BEFORE shipping. Caching makes it tractable. RAG often replaces it entirely.
Models have weaker recall on tokens 50%-90% of context window - research-documented 'lost in the middle' effect. Long-context isn't a free win for accuracy.
Stuffing whole codebase / corpus in context = bigger blast radius if vendor has incident. RAG with chunk-level retrieval limits exposure.
Cached prompts live in vendor infra during TTL. For sensitive data, verify encryption-at-rest + TTL behavior.
Long-context responses are slow - model has to read all input first. RAG is faster but with less context. Trade-off: comprehensiveness vs latency.
Cache control parameters differ per vendor. Multi-vendor abstraction must handle each. LiteLLM supports it; custom code needs explicit handling.
Long-context bills surprise teams when prompts get longer over time (more tools, more docs added). Add token-per-query monitoring to detect drift before bill arrives.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Legal review needs to cross-reference between sections. Long-context essential. Premium tier (Opus) for accuracy. ~$2K/mo at 20 contracts/day. Worth it - paralegal time replaced.
Healthy range: $1.5-2.5K/mo (premium tier justified)
Synthesize 5-10 papers into one analysis. Cached context for repeated queries on same set. ~$1.5K/mo. RAG could work but loses cross-paper connections.
Healthy range: $1-2K/mo with caching
User asks multiple questions about same long video. 85% cache hit (same transcript). Cheap tier fine for Q&A. ~$500/mo at 500 queries/day.
Healthy range: $300-700/mo with high cache rate
Internal Q&A over 10K-doc knowledge base. RAG retrieves 4-6 chunks per query. Long-context would be $50K+/mo. RAG is $400/mo. No contest.
Healthy range: $200-500/mo with RAG
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Prompt Cache ROI for caching detail. RAG Pipeline for the alternative.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →