At what doc size does your cost double?
Several frontier models - Gemini 3.1 Pro, GPT-5.4, GPT-5.5 - have long-context tiers that raise the input price past a token threshold. Find where yours flips, and the cheapest model that fits without the penalty.
When you pass long documents, many vendors charge 2× input / 1.5× output above a threshold (usually 128K or 200K tokens). This calc shows where you cross, what the premium costs, and which model is the sweet spot at your context size.
- Long-context premium tiers blindside teams that didn't read the fine print — 2× input is brutal at 500K-token RAG
- Sweet-spot model varies by context size: GPT-5.5 wins at 50K, Gemini 3.1 Pro wins at 1M (the only major model with 2M ctx)
- See exactly where each model crosses its premium threshold so you can engineer prompts to stay below it
- Compare flat-pricing models (Claude Opus 4.7/4.8 within their 128K window) vs tiered (GPT-5.5 above 272K, Gemini 3.1 Pro above 200K) at your actual context size
These are the inputs, outputs, and how you can use this calculator for your AI workloads.
- Input tokens per requestTotal context size including RAG + history
- Output tokens per requestResponse size
- Monthly requestsVolume scaling
- Sweet-spot model recommendationCheapest fit at your context size
- Cost vs context-size chartWhere each model jumps to premium tier
- Per-model threshold listWhere each model's premium tier kicks in
- Over-context flagsModels that can't fit your input
See premium-tier triggers BEFORE production usage doubles your bill
Cheapest model that fits + stays in base tier at your real context size
See how close you are to a threshold and whether to engineer down or switch model
Document-heavy workloads have 5-10× cost variance by model — this surfaces it
👇 Now try the calculator below with your own AI workloads
Per-request cost at current input size
Models with tier transitions triggered show in orange. Models where your input doesn't fit are dimmed.
Monthly cost vs. input size
Spot the cliffs where pricing tiers flip. Curves are kinked, not smooth.
Long-context thresholds by model
- Find the sweet spot — the cheapest model that fits your input AND stays in its base pricing tier
- Catch the threshold trap — see exactly where each model flips to premium long-context pricing
- Compare strategies — re-engineer the prompt down, or switch to a flat-priced long-context model, before you commit
What this means + what to do next
- Quality at long context — many models lose recall above 100K (lost-in-the-middle); this calc doesn't score quality
- Latency at long context — TTFT often climbs sharply above 200K tokens, even for fast models
- Prompt-cache interaction — cached reads at 50-90% off can flip the cheapest-model answer if you have a stable prefix
- Vendor rate limits — some long-context models have lower TPM/RPM limits than their short-context tier
- Stable long-context prefixes cache well — caching often beats model-swap for RAG Prompt Cache Roi
- If long context is from RAG, optimizing retrieval (smaller chunks, better reranking) cuts BOTH cost AND threshold risk Rag Pipeline
- Often 20-40% of long-context prompts is reducible without quality loss Token Reduction Analyzer
Long-context models trade cost for capability. ROI questions:
- Does my workload genuinely need full-context processing, or can RAG retrieve just the relevant parts?
- Would switching to a higher-threshold model (GPT-5.5 with 272K limit) save more than caching with a lower-threshold one (Gemini 3.1 Pro at 200K)?
- How sensitive is downstream quality to context-window pruning vs aggressive retrieval?
- RAG with smaller chunks may match full-context quality at 10-100× lower cost Rag Pipeline
- Route long-context queries to higher-threshold models (GPT-5.5), short ones to cheaper Flash-tier models Multi Model Router
- Once you've picked a model + context strategy, get the exact $/month Cost Calculator
If context size isn't your dominant cost variable:
- You're not near any threshold — standard tier pricing is fine Cost Calculator
- Long context comes from RAG retrieval — optimize there instead Rag Pipeline
- You don't care about threshold engineering, just want the cheapest valid pick Cheapest Model