Guides → Playground & Guide → Context Window Cost - When Long-Context Doubles Your Bill

Context Window Cost - When Long-Context Doubles Your Bill

Meet Hannah Park. Senior Engineer at a doc-analysis startup. "Gemini 1M context lets us pass entire codebases. Should we, or is RAG cheaper?"

🔥 First long-context experiment cost $80 for one task. Could not be production-ready math.

The story

Long-context windows are a UX leap and a cost trap. Gemini 3 Pro: 2M tokens. Claude Sonnet 4.6: 1M. GPT-5: 200K (cached cheap). The temptation: 'just stuff the whole codebase / corpus / docset into the prompt.' The math: that's $5-50 per query depending on model and length.

Hannah's experiment: 800K tokens of code in context, 5K-token analysis output. On Gemini 3 Pro: $1.20 input + $0.10 output ≈ $1.30/query. Sounds fine - until 100 queries/day = $3,900/mo. On Sonnet 4.6: $2,400 input + $75 output = $2,475 first query (un-cached). With caching: $250 cache write + $25 cache read per repeat. Massively cheaper IF queries hit the same cache window.

Three regimes for long-context decisions. (1) Single-shot (analyze this 500-page document once): long-context wins on simplicity. (2) Repeated queries on same context (Q&A over same codebase): caching dominates economics. (3) Diverse queries on different contexts: RAG with retrieval beats long-context.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Input tokens per request — Total input token count per call: system prompt + retrieval + conversation history + user message + tool definitions.

How to choose: Use your typical (not worst-case) size. Run a real rendered example through Token Estimator if you don't have a number. Examples: simple chat 500-2K, RAG with top-5 chunks 5-10K, long-document analysis 50-500K.

▸ Output tokens per request — Tokens the model generates. Output isn't usually subject to context-window premium tiers, but it still costs 3-5× more per token than input.

How to choose: Constrain explicitly in your prompt. Typical: chat reply 150-600, summary 200-800, code generation 500-2000, long-form 1000-4000.

▸ Monthly requests — How many requests per month at this token shape. Multiplies all per-call costs.

How to choose: Use actual telemetry if you have it; otherwise peak users × requests per user per day × 30 + 30% retry buffer.

📊 Outputs computed for you

What you'll see after the calculator runs. Each card explains how to read the number.

▸ Sweet-spot model recommendation — The model that's cheapest at your specific context size while still fitting (not over context) and ideally not triggering a premium tier.

How to read: Start here. If the sweet-spot model meets your quality bar on eval, this is your answer. If not, work up the price ranking until you find one that does.

▸ Cost vs context-size chart — Line chart showing per-request cost for each model across context sizes from 1K to 2M. Vertical jumps = premium-tier trigger points.

How to read: Flat lines (Claude Opus within its 128K window, DeepSeek V3.2 within 128K) are tier-stable. Step-functions (GPT-5.5 at 272K, Gemini 3.1 Pro at 200K) show where they expensify. If your context size sits on a step edge, small prompt growth = big cost jump.

▸ Per-model threshold list — Table of every model showing: max context, premium threshold (if any), base price, premium price.

How to read: For models with thresholds, your goal is to stay below. Models without thresholds (flat pricing) are predictable as context grows.

▸ Over-context flags — Models whose max context is smaller than your input. These are silently broken — they'll truncate or error.

How to read: If your context size puts a model on the over-context list, eliminate it from consideration. Don't try to "make it work" by truncating — quality cliffs hard.

About this calculator: Context Window Cost - When Long-Context Doubles Your Bill

1M-token context windows enable new use cases - and double your bill. Find the threshold where chunking + RAG beats long-context, and where it doesn't.

Inputs you control

Input	Impact on result	Range	Typical
Context tokens per query	How big the context is. Small RAG: 5-10K. Long doc: 50-200K. Whole codebase: 500K-2M.	10K – 2M	800000
Queries per day	Per-query cost × volume. Long-context costs compound fast.	1 – 100K	100
Cache hit rate (if reused context)	Fraction of queries that hit cached context (same long context, multiple questions). Higher = bigger savings vs no-cache.	0 – 0.95	0.7

Outputs computed for you · model: `token`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Context tokens per query 800,000

How big the context is. Small RAG: 5-10K. Long doc: 50-200K. Whole codebase: 500K-2M.

Estimated: —

Queries per day 100

Per-query cost × volume. Long-context costs compound fast.

Estimated: —

Cache hit rate (if reused context) 0.7

Fraction of queries that hit cached context (same long context, multiple questions). Higher = bigger savings vs no-cache.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Naive long-context cost is huge. 800K input × $3/1M = $2.40 per query. Without caching, $7.2K/mo at 100/day. Most teams can't afford this.

With caching, math changes dramatically. Cache write (first query): full price. Cache reads (repeats): 10% of normal. At 70% hit rate, effective input cost drops to ~$0.30/query. Total ~$900/mo.

RAG comparison. Same workload via RAG: 8K retrieved tokens per query × $3/1M × 100/day = $24/mo. 30× cheaper than cached long-context. Quality may differ - long-context can find connections RAG misses.

The real question: do you NEED full context? If yes (cross-document reasoning, code architecture), pay for it. If no (specific answer to specific question), use RAG. Most workloads don't need full context - they think they do.

What "good" looks like:

Long-context wins: Cross-document reasoning, code architecture analysis, multi-file refactoring
RAG wins: Specific question over large corpus, fact lookup, top-k relevant chunk retrieval
Hybrid wins: Cached long context for repeated similar questions on same content
Cost-prohibitive: >500K context × >100 queries/day without caching - fix architecture

Models with 200K+ context windows

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$404.97 / month ≈ $4,860 / year

50 different long PDFs analyzed daily, each one once. No cache benefit. ~$900/mo on Sonnet. Long-context is the right call here - RAG would lose too much cross-document context.

Healthy range: $700-1,200/mo at 50 unique docs/day

See inputs used

contextTokens: 200,000
queriesPerDay: 50
cacheHitRate: 0
outputTokens: 3,000
modelTier: balanced
workingDaysPerMonth: 22

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Caching (mandatory above 50K context) 10× cheaper on repeated context
RAG when context > 100K and queries > 1K/day 30-90% savings
Single-shot premium for one-off analysis Don't engineer RAG for 5 queries/day

Long-context cost compounds catastrophically fast. Math the per-query cost × volume BEFORE shipping. Caching makes it tractable. RAG often replaces it entirely.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$2,024 / month ≈ $24,292 / year

Legal review needs to cross-reference between sections. Long-context essential. Premium tier (Opus) for accuracy. ~$2K/mo at 20 contracts/day. Worth it - paralegal time replaced.

Healthy range: $1.5-2.5K/mo (premium tier justified)

See inputs used

contextTokens: 300,000
queriesPerDay: 20
cacheHitRate: 0.5
outputTokens: 4,000
modelTier: premium
workingDaysPerMonth: 22

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Cache TTL varies (Anthropic 5min default, 1hr extended; Google differs).
Doesn't model the lost-in-the-middle quality drop quantitatively.
Output token cost compounds with input on long-context queries.
Some vendors' long-context tier prices differently (Gemini context caching) - check current pricing.

For these, use: Prompt Cache ROI for caching detail. RAG Pipeline for the alternative.

Where to go next

Cache ROI math →

Cache hit rate × discount = savings.

RAG as alternative →

Full pipeline cost comparison.

Long-context in agent loops →

Context grows turn-by-turn.

Methodology

Source: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Extraction: Per-vendor context window + caching pricing extracted weekly.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Context Window Cost - When Long-Context Doubles Your Bill

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Context Window Cost - When Long-Context Doubles Your Bill

Inputs you control

Outputs computed for you · model: `token`

What you're looking at

Ready to run the numbers?

Reading your result

Models with 200K+ context windows

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Context Window Cost - When Long-Context Doubles Your Bill

Inputs you control

Outputs computed for you · model: token

What you're looking at

Ready to run the numbers?

Reading your result

Models with 200K+ context windows

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `token`