Context Window Cost Analyzer

At what doc size does your cost double?

Several frontier models - Gemini 3.1 Pro, GPT-5.4, GPT-5.5 - have long-context tiers that raise the input price past a token threshold. Find where yours flips, and the cheapest model that fits without the penalty.

Pricing verified: 2026-06-05 Big-doc / codebase / repo workloads
What this calculator does

When you pass long documents, many vendors charge 2× input / 1.5× output above a threshold (usually 128K or 200K tokens). This calc shows where you cross, what the premium costs, and which model is the sweet spot at your context size.

Why use it
  • Long-context premium tiers blindside teams that didn't read the fine print — 2× input is brutal at 500K-token RAG
  • Sweet-spot model varies by context size: GPT-5.5 wins at 50K, Gemini 3.1 Pro wins at 1M (the only major model with 2M ctx)
  • See exactly where each model crosses its premium threshold so you can engineer prompts to stay below it
  • Compare flat-pricing models (Claude Opus 4.7/4.8 within their 128K window) vs tiered (GPT-5.5 above 272K, Gemini 3.1 Pro above 200K) at your actual context size
Who uses this:
Vibe Coder Medium Useful when you hit "why did my cost double?" — answer is usually premium-tier trigger Small Business High RAG and long-document workloads frequently cross thresholds — this catches it before production Enterprise High Document-analysis and legal/research workloads live in long-context territory — model selection here has 5-10× impact

These are the inputs, outputs, and how you can use this calculator for your AI workloads.

📥 Inputs you provide
  • Input tokens per requestTotal context size including RAG + history
  • Output tokens per requestResponse size
  • Monthly requestsVolume scaling
📤 Outputs you get
  • Sweet-spot model recommendationCheapest fit at your context size
  • Cost vs context-size chartWhere each model jumps to premium tier
  • Per-model threshold listWhere each model's premium tier kicks in
  • Over-context flagsModels that can't fit your input
🎯 Use your results to
🎯
Catch threshold blindside

See premium-tier triggers BEFORE production usage doubles your bill

🍯
Find the sweet-spot model

Cheapest model that fits + stays in base tier at your real context size

✂️
Identify reduction targets

See how close you are to a threshold and whether to engineer down or switch model

📊
Defensible RAG model choice

Document-heavy workloads have 5-10× cost variance by model — this surfaces it

👇 Now try the calculator below with your own AI workloads

📊 Calculator at a glance
Context Window Cost full size
🎛 CALCULATOR
📋 Example Workload - change any field to see your actual numbers
Input tokens per requestTotal input token count per call: system prompt + retrieval + conversation history + user message + tool definitions.How to choose: Use your typical (not worst-case) size. Run a real rendered example through Token Estimator if you don't have a number. Examples: simple chat 500-2K, RAG with top-5 chunks 5-10K, long-document analysis 50-500K.Read the full guide →
≈ 4 characters per token in English · drag to see tier transitions
100K
1K 100K 200K ⚠ 272K ⚠ 500K 1M 2M
Typical workload:
Long-context cost is input-dominated - the default of 500 is fine for most workloads.
📈 RESULTS

Per-request cost at current input size

Models with tier transitions triggered show in orange. Models where your input doesn't fit are dimmed.

Monthly cost vs. input size

Spot the cliffs where pricing tiers flip. Curves are kinked, not smooth.

Long-context thresholds by model

ModelThresholdMultiplierStatus for your input
💡 Recommendations
    📋 What now?
    • Find the sweet spot — the cheapest model that fits your input AND stays in its base pricing tier
    • Catch the threshold trap — see exactly where each model flips to premium long-context pricing
    • Compare strategies — re-engineer the prompt down, or switch to a flat-priced long-context model, before you commit
    Need help cutting your AI bill? 💼 Talk to a CloudIntelligence advisor →
    Full cost calculator → Try RAG instead? → Token volatility →
    Now that you have your cost at context size…

    What this means + what to do next

    💡 What to consider beyond this cost at context size for full TCO
    • Quality at long context — many models lose recall above 100K (lost-in-the-middle); this calc doesn't score quality
    • Latency at long context — TTFT often climbs sharply above 200K tokens, even for fast models
    • Prompt-cache interaction — cached reads at 50-90% off can flip the cheapest-model answer if you have a stable prefix
    • Vendor rate limits — some long-context models have lower TPM/RPM limits than their short-context tier
    Rule of thumb: For document-heavy workloads, run a context-fidelity eval (needle-in-haystack, multi-hop QA) on your candidates at your actual context size. Sweet-spot by price isn't always sweet-spot by quality.
    Quantify the hidden costs:
    • Stable long-context prefixes cache well — caching often beats model-swap for RAG Prompt Cache Roi
    • If long context is from RAG, optimizing retrieval (smaller chunks, better reranking) cuts BOTH cost AND threshold risk Rag Pipeline
    • Often 20-40% of long-context prompts is reducible without quality loss Token Reduction Analyzer
    $ How this fits your overall ROI

    Long-context models trade cost for capability. ROI questions:

    • Does my workload genuinely need full-context processing, or can RAG retrieve just the relevant parts?
    • Would switching to a higher-threshold model (GPT-5.5 with 272K limit) save more than caching with a lower-threshold one (Gemini 3.1 Pro at 200K)?
    • How sensitive is downstream quality to context-window pruning vs aggressive retrieval?
    Bridge to ROI:
    • RAG with smaller chunks may match full-context quality at 10-100× lower cost Rag Pipeline
    • Route long-context queries to higher-threshold models (GPT-5.5), short ones to cheaper Flash-tier models Multi Model Router
    • Once you've picked a model + context strategy, get the exact $/month Cost Calculator
    Doing something different?

    If context size isn't your dominant cost variable:

    • You're not near any threshold — standard tier pricing is fine Cost Calculator
    • Long context comes from RAG retrieval — optimize there instead Rag Pipeline
    • You don't care about threshold engineering, just want the cheapest valid pick Cheapest Model

    Go deeper

    Our playbooks on cutting this number.

    🧬
    Embedding Cost
    RAG may be cheaper than stuffing context
    🧮
    Cost Calculator
    Full workload calculator
    📉
    Token Volatility
    Tier transitions explained
    🔍
    AI Model Finder
    Filter by context window

    The calculator's an estimate. Want the real number?

    A 5-day Quickscan ($1,500) reviews your actual usage across every pillar — financial, reliability, governance, privacy, MLOps, observability — and returns a concrete savings plan.

    Book a Quickscan →
    📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

    Methodology

    • All prices are USD per 1 million tokens, current as of 2026-06-05.
    • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
    • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
    • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
    • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
    • Long-context pricing tiers apply when input exceeds model threshold.
    • Embedding prices are input-only (no output tokens generated).

    Primary sources

    Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

    Anthropic
    2026-06-05
    https://www.anthropic.com/pricing
    Daily snapshot since Sep 2023 · 578 days captured
    Anthropic Docs
    2026-06-05
    https://platform.claude.com/docs/en/about-claude/pricing
    Daily snapshot since Sep 2023 · 578 days captured
    OpenAI
    2026-06-05
    https://openai.com/api/pricing/
    Daily snapshot since Sep 2023 · 579 days captured
    Google AI
    2026-06-05
    https://ai.google.dev/gemini-api/docs/pricing
    Daily snapshot since Dec 2023 · 554 days captured
    Google Vertex
    2026-06-05
    https://cloud.google.com/vertex-ai/generative-ai/pricing
    Daily snapshot since Dec 2023 · 554 days captured
    DeepSeek
    2026-06-05
    https://api-docs.deepseek.com/quick_start/pricing
    Daily snapshot since May 2024 · 493 days captured
    xAI
    2026-06-05
    https://x.ai/api
    Daily snapshot since Nov 2024 · 411 days captured
    Mistral
    2026-06-05
    https://mistral.ai/pricing
    Daily snapshot since Dec 2023 · 552 days captured
    Cohere
    2026-06-05
    https://cohere.com/pricing
    Daily snapshot since Sep 2023 · 578 days captured

    Inferred values (marked with * in calculator tables)

    Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

    Vendor / Model Field Why it’s inferred
    Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
    Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
    Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
    OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
    OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
    OpenAI — GPT-5.2 batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 cachedInput Derived at 10% of input.
    OpenAI — GPT-5 batchInput Derived at 50% of input.
    OpenAI — GPT-5 batchOutput Derived at 50% of output.
    OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
    OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
    OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.1 batchInput Derived at 50% of input.
    OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
    OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
    OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
    Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
    Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
    Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

    Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →