Guides → Playground & Guide → Multi-Model Router - Route Queries to the Cheapest Capable Model
Meet Priya Patel. Eng Lead at a 70-person AI startup. "Half our queries are simple - why pay Sonnet rates for them? How do I build smart routing?"
🔥 $25K/mo Anthropic bill. Estimated 40% goes to queries Haiku could handle.
Multi-model routing is the single highest-leverage cost optimization at scale. Most production AI workloads have 50-70% queries that don't need the flagship model. Classification, simple extraction, format conversion, follow-up clarifications, status checks - all handled fine by Haiku-class models at 5-10× lower cost.
Priya's $25K/mo Sonnet-only bill: estimated 40% of queries would route to Haiku (saving ~$8K/mo on those). Another 10% to ultra-cheap (Gemini Flash for trivial classification). 50% remains on Sonnet for complex reasoning. Total bill drops from $25K to ~$15K - 40% reduction with no quality regression.
Three routing strategies. (1) Rule-based - query length / type / source determines tier. Simple, brittle, ships in a day. (2) Cheap classifier router - small model decides which tier should handle. Smarter, costs $50-200/mo to run. (3) Cascade - try cheap tier, escalate to premium if confidence is low. Expensive on edge cases, optimal on average. Most teams converge on hybrid (rules + classifier).
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
What you'll see after the calculator runs. Each card explains how to read the number.
Route easy queries to cheap models, hard queries to premium. Real architecture for cutting AI bills 40-60% without quality drop. Implementation patterns + savings math.
router
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Savings = current_spend × easy_share × (1 - 1/ratio). Priya: $25K × 0.4 × (1 - 1/8) = $8.75K/mo savings = $105K/yr.
Subtract router overhead. Cheap classifier router costs ~1% of original bill (~$250/mo at this scale). Engineering: 40 hours upfront × $150 = $6K. Payback: <1 month.
Watch for misroutes. Aggressive routing → wrong queries to cheap tier → bad answers → user complaints → revert. Build escalation path (low confidence → premium fallback) from day 1.
Stack with caching for compounding savings. Router + caching = 60-70% bill reduction. Each lever cuts independently.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
$2K bill. 50% easy queries × 5/6 = $830/mo savings. Setup $6K. Payback 7 months. Marginal - invest engineering on features instead unless quality wins compound.
Healthy range: Savings ~$830/mo - payback 7 months
$25K → $16.5K with routing. $8.5K/mo savings. Setup $6K. Payback 3 weeks. Mandatory.
Healthy range: Savings $8.5K/mo - payback <1 month
$80K consumer-scale. Cascade router (try cheap first, escalate). 60% easy share × 9/10 = $43K savings. $12K setup. Payback 2 weeks. Dramatic.
Healthy range: Savings $43K/mo (54%)
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Start with rule-based for fast win. Upgrade to classifier when you have eval data. Move to cascade when scale justifies.
The router itself can mis-classify. Always have escalation: cheap tier returns low-confidence → retry with premium.
Routing across vendors complicates compliance. Each tier must independently meet the workload's compliance requirements.
Don't route sensitive data to a tier without no-train. Compliance + privacy are per-tier checks.
Router latency is small. Cascade adds latency only on the 10-15% of queries that escalate. Net latency usually improves (cheap models faster).
Routing across vendors (Anthropic + OpenAI + Google) is a feature, not bug - vendor abstraction at the application layer. LiteLLM helps.
Router behavior must be monitored - mis-routes silently degrade UX. Add per-route quality + cost dashboards.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Agent decides between simple actions (cheap model) and complex reasoning (premium). 70% easy split typical. ~$9K/mo savings on $15K bill.
Healthy range: Savings $9K/mo - biggest fit
Cheap model classifies query intent first. 30% can be answered directly without RAG retrieval. ~$3K/mo savings, plus latency win.
Healthy range: Savings $3K/mo
Rule-based: short queries (<100 tokens) → Haiku. Long with multiple constraints → Sonnet. Ships in 2 days. ~$3.4K/mo.
Healthy range: Savings $3.4K/mo, 2 days work
Autocomplete: 80% trivial completions (cheap), 20% multi-line generation (premium). Cascade. ~$36K/mo savings.
Healthy range: Savings $36K/mo (72%)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Cheapest Model per task type. Cost Calculator for full bill.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →