Context Window Cost Analyzer

At what doc size does your cost double?

Several frontier models - Gemini 3.1 Pro, GPT-5.4, GPT-5.5 - have long-context tiers that raise the input price past a token threshold. Find where yours flips, and the cheapest model that fits without the penalty.

Pricing verified: 2026-06-05 Big-doc / codebase / repo workloads

What this calculator does

When you pass long documents, many vendors charge 2× input / 1.5× output above a threshold (usually 128K or 200K tokens). This calc shows where you cross, what the premium costs, and which model is the sweet spot at your context size.

Why use it

Long-context premium tiers blindside teams that didn't read the fine print — 2× input is brutal at 500K-token RAG
Sweet-spot model varies by context size: GPT-5.5 wins at 50K, Gemini 3.1 Pro wins at 1M (the only major model with 2M ctx)
See exactly where each model crosses its premium threshold so you can engineer prompts to stay below it
Compare flat-pricing models (Claude Opus 4.7/4.8 within their 128K window) vs tiered (GPT-5.5 above 272K, Gemini 3.1 Pro above 200K) at your actual context size

📖 Read the full guide →

Who uses this:

Vibe Coder Medium Small Business High Enterprise High

These are the inputs, outputs, and how you can use this calculator for your AI workloads.

📥 Inputs you provide

Input tokens per requestTotal context size including RAG + history
Output tokens per requestResponse size
Monthly requestsVolume scaling

📤 Outputs you get

Sweet-spot model recommendationCheapest fit at your context size
Cost vs context-size chartWhere each model jumps to premium tier
Per-model threshold listWhere each model's premium tier kicks in
Over-context flagsModels that can't fit your input

🎯 Use your results to

🎯

Catch threshold blindside

See premium-tier triggers BEFORE production usage doubles your bill

🍯

Find the sweet-spot model

Cheapest model that fits + stays in base tier at your real context size

✂️

Identify reduction targets

See how close you are to a threshold and whether to engineer down or switch model

📊

Defensible RAG model choice

Document-heavy workloads have 5-10× cost variance by model — this surfaces it

👇 Now try the calculator below with your own AI workloads

Get your context size right

This is the SUM of every input token per request: system prompt + retrieved docs + conversation history + user query + tool definitions. Don't use your prompt template — use a realistic rendered example. RAG with top-5 chunks at 500 tokens each = 2.5K from retrieval alone. Conversation history from 10 turns can add 5-20K. Long-document analysis can hit 100K+. Set the slider to your typical (not worst-case) size.

Watch the threshold triggers

GPT-5.5: above 272K input, jumps to $10/$45 per 1M (input doubles, output 1.5×). GPT-5.5 Pro: same 272K threshold, $60/$270. GPT-5.4: 270K threshold, $5/$22.50. Gemini 3.1 Pro: above 200K, $2/$12 → $4/$18 (doubles). Claude Opus 4.7/4.8 and Sonnet 4.6: flat pricing, no published premium tier — but their max context is smaller (128K Opus / 64K Sonnet), so prompts that exceed it just won't fit. Grok 4.3: doubles above 200K (1M ctx max). DeepSeek V3.2: flat $0.28/$0.42 across its 128K window. The threshold list shows every model's cross-point — if you're close to a threshold, engineer your prompt down (more aggressive retrieval, summarization, context-window pruning).

Identify the sweet-spot model

The "Sweet spot" badge marks the cheapest model that fits your context AND stays in its base tier. At 50K tokens, this is usually GPT-5-mini or Gemini 3 Flash. At 200K, often Gemini 3 Flash or Gemini 3.1 Pro (just under its threshold). At 1M, Gemini 3.1 Pro is the main option (most others either don't support it or hit premium). "Over context" badge flags models whose max context is smaller than your input — Claude Sonnet 4.6 at 64K and most Claude Opus variants at 128K will hit this for long-document workloads.

Re-engineer or accept the premium

Three options if you're hitting premium: (1) Cut input — see Token Reduction for compression strategies. (2) Switch to a model with a higher threshold — GPT-5.5 doesn't expensify until 272K (vs Gemini 3.1 Pro at 200K), so for 200-270K workloads GPT-5.5 stays in its base tier while Gemini 3.1 Pro has already doubled. (3) Accept and price-in — sometimes the quality gain at large context justifies the premium. The chart visualizes this trade-off across all models simultaneously.

📊 Calculator at a glance

📅 Schedule a meeting via AvatarVA ✉️ Email [email protected]

🎛 CALCULATOR

Input tokens per request

≈ 4 characters per token in English · drag to see tier transitions

100K

1K 100K 200K ⚠ 272K ⚠ 500K 1M 2M

Typical workload:

Requests / month

Output tokens / req Long-context cost is input-dominated - the default of 500 is fine for most workloads.

📈 RESULTS

Per-request cost at current input size

Models with tier transitions triggered show in orange. Models where your input doesn't fit are dimmed.

Monthly cost vs. input size

Spot the cliffs where pricing tiers flip. Curves are kinked, not smooth.

Long-context thresholds by model

ModelThresholdMultiplierStatus for your input

💡 Recommendations

📋 What now?

Find the sweet spot — the cheapest model that fits your input AND stays in its base pricing tier
Catch the threshold trap — see exactly where each model flips to premium long-context pricing
Compare strategies — re-engineer the prompt down, or switch to a flat-priced long-context model, before you commit

Need help cutting your AI bill? 💼 Talk to a CloudIntelligence advisor →

Full cost calculator → Try RAG instead? → Token volatility →

Now that you have your cost at context size…

What this means + what to do next

💡 What to consider beyond this cost at context size for full TCO

Quality at long context — many models lose recall above 100K (lost-in-the-middle); this calc doesn't score quality
Latency at long context — TTFT often climbs sharply above 200K tokens, even for fast models
Prompt-cache interaction — cached reads at 50-90% off can flip the cheapest-model answer if you have a stable prefix
Vendor rate limits — some long-context models have lower TPM/RPM limits than their short-context tier

Rule of thumb: For document-heavy workloads, run a context-fidelity eval (needle-in-haystack, multi-hop QA) on your candidates at your actual context size. Sweet-spot by price isn't always sweet-spot by quality.

Quantify the hidden costs:

Stable long-context prefixes cache well — caching often beats model-swap for RAG Prompt Cache Roi
If long context is from RAG, optimizing retrieval (smaller chunks, better reranking) cuts BOTH cost AND threshold risk Rag Pipeline
Often 20-40% of long-context prompts is reducible without quality loss Token Reduction Analyzer

$ How this fits your overall ROI

Long-context models trade cost for capability. ROI questions:

Does my workload genuinely need full-context processing, or can RAG retrieve just the relevant parts?
Would switching to a higher-threshold model (GPT-5.5 with 272K limit) save more than caching with a lower-threshold one (Gemini 3.1 Pro at 200K)?
How sensitive is downstream quality to context-window pruning vs aggressive retrieval?

Bridge to ROI:

RAG with smaller chunks may match full-context quality at 10-100× lower cost Rag Pipeline
Route long-context queries to higher-threshold models (GPT-5.5), short ones to cheaper Flash-tier models Multi Model Router
Once you've picked a model + context strategy, get the exact $/month Cost Calculator

Doing something different?

If context size isn't your dominant cost variable:

You're not near any threshold — standard tier pricing is fine Cost Calculator
Long context comes from RAG retrieval — optimize there instead Rag Pipeline
You don't care about threshold engineering, just want the cheapest valid pick Cheapest Model

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

At what doc size does your cost double?

Per-request cost at current input size

Monthly cost vs. input size

Long-context thresholds by model

What this means + what to do next

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)