Guides → Playground & Guide → Multi-Model Router - Route Queries to the Cheapest Capable Model

Multi-Model Router - Route Queries to the Cheapest Capable Model

Meet Priya Patel. Eng Lead at a 70-person AI startup. "Half our queries are simple - why pay Sonnet rates for them? How do I build smart routing?"

🔥 $25K/mo Anthropic bill. Estimated 40% goes to queries Haiku could handle.

The story

Multi-model routing is the single highest-leverage cost optimization at scale. Most production AI workloads have 50-70% queries that don't need the flagship model. Classification, simple extraction, format conversion, follow-up clarifications, status checks - all handled fine by Haiku-class models at 5-10× lower cost.

Priya's $25K/mo Sonnet-only bill: estimated 40% of queries would route to Haiku (saving ~$8K/mo on those). Another 10% to ultra-cheap (Gemini Flash for trivial classification). 50% remains on Sonnet for complex reasoning. Total bill drops from $25K to ~$15K - 40% reduction with no quality regression.

Three routing strategies. (1) Rule-based - query length / type / source determines tier. Simple, brittle, ships in a day. (2) Cheap classifier router - small model decides which tier should handle. Smarter, costs $50-200/mo to run. (3) Cascade - try cheap tier, escalate to premium if confidence is low. Expensive on edge cases, optimal on average. Most teams converge on hybrid (rules + classifier).

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Monthly requests — Total API requests per month across all tiers. Scales the absolute dollar savings — the routing math is identical at any volume, but volume decides whether it clears the engineering cost of building the router.

How to choose: Pull from your provider dashboard, or requests/day × 30. Below roughly $1K/mo of spend the build usually is not worth it; the bigger and more mixed your traffic, the bigger the payoff.

▸ Input tokens / request — Average input tokens per request. Applied equally across all three tiers, so it scales every tier's cost together.

How to choose: Use your real average prompt size. If tiers differ a lot in practice, model the dominant one — this is a planning estimate, not a per-query trace.

▸ Output tokens / request — Average output tokens per request. Output is usually the pricier half of a call, so it weighs heavily on the premium tier.

How to choose: Set to your real average completion length. Long generations make the premium tier more expensive, which increases the payoff from routing easy queries away from it.

▸ Tier split — How your traffic divides across cheap, mid, and premium models. This is the core routing decision — most production workloads have 50-70% of queries a cheap model handles fine.

How to choose: Cheap (🟢): simple/structured work — classification, extraction, FAQ, autocomplete, status checks. Mid (🟡): standard reasoning — explanations, synthesis, summaries. Premium (🟣): hard reasoning — architecture, multi-step logic, edge cases. Start from a preset, then drag the sliders; they co-adjust toward 100%.

▸ Baseline model — The single model you would run for everything if you were not routing — the thing routing is measured against. Defaults to your premium-tier model.

How to choose: Set it to what you actually run today (or would). The savings card and comparison table are all relative to this; an unrealistic baseline makes the savings look better or worse than reality.

▸ Routing method — How queries get assigned to a tier, and what that decision costs. Routing is not free — something has to decide where each query goes.

How to choose: No classifier = rules (length, type, source): zero overhead, brittle. LLM-as-judge = a small model decides per query: ~1% of bill, smartest, ~90-95% accurate. Embedding = match to intent clusters: cheaper, ~85-92% accurate. Always build a low-confidence → premium escalation path so misroutes do not tank quality.

📊 Outputs computed for you

What you'll see after the calculator runs. Each card explains how to read the number.

▸ Baseline cost — Monthly cost if every request ran on the baseline model.

How to read: The number routing has to beat — compare it against the routed card.

▸ Routed cost — Monthly cost of your tier mix, including classifier overhead.

How to read: Below baseline = routing pays. The cost-share bar shows where the dollars actually land per tier.

▸ Monthly savings — Baseline minus routed — the headline monthly and annual dollars saved.

How to read: Aggressive savings usually mean a lot of traffic on the cheap tier, so confirm quality holds there on an eval set before banking it.

▸ Classifier overhead — What the routing decision itself costs per month.

How to read: Should be a small slice (often ~1%). If it is a large share of routed cost, simplify toward rules-based routing.

About this calculator: Multi-Model Router - Route Queries to the Cheapest Capable Model

Route easy queries to cheap models, hard queries to premium. Real architecture for cutting AI bills 40-60% without quality drop. Implementation patterns + savings math.

Inputs you control

Input	Impact on result	Range	Typical
Current monthly spend (single-tier)	Bill if you stay on one premium model for everything.	500 – 500K	25000
Easy-query share (%) - cheap-model-eligible	Fraction of queries handleable by cheap tier. Audit by sampling 100 queries; categorize by complexity.	0 – 90	40
Premium / cheap price ratio	How much cheaper the cheap tier is per task. Sonnet vs Haiku: ~7-10×. Sonnet vs Gemini Flash: ~15-20×.	3 – 30	8

Outputs computed for you · model: `router`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Current monthly spend (single-tier) 25,000

Bill if you stay on one premium model for everything.

Estimated: —

Easy-query share (%) - cheap-model-eligible 40

Fraction of queries handleable by cheap tier. Audit by sampling 100 queries; categorize by complexity.

Estimated: —

Premium / cheap price ratio 8

How much cheaper the cheap tier is per task. Sonnet vs Haiku: ~7-10×. Sonnet vs Gemini Flash: ~15-20×.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Savings = current_spend × easy_share × (1 - 1/ratio). Priya: $25K × 0.4 × (1 - 1/8) = $8.75K/mo savings = $105K/yr.

Subtract router overhead. Cheap classifier router costs ~1% of original bill (~$250/mo at this scale). Engineering: 40 hours upfront × $150 = $6K. Payback: <1 month.

Watch for misroutes. Aggressive routing → wrong queries to cheap tier → bad answers → user complaints → revert. Build escalation path (low confidence → premium fallback) from day 1.

Stack with caching for compounding savings. Router + caching = 60-70% bill reduction. Each lever cuts independently.

What "good" looks like:

Strong router fit: >$5K/mo bill, mixed query complexity, premium-only today - 30-50% savings achievable
Moderate fit: $1-5K/mo, some query diversity - 15-30%
Skip routing: <$1K/mo (eng cost > savings) OR uniform-complexity workload
Cascade pattern: Best for variable difficulty, more eng to build

Models for routing tiers (cheap/mid/premium)

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$1,167 / month ≈ $14,000 / year

$2K bill. 50% easy queries × 5/6 = $830/mo savings. Setup $6K. Payback 7 months. Marginal - invest engineering on features instead unless quality wins compound.

Healthy range: Savings ~$830/mo - payback 7 months

See inputs used

currentMonthlyUsd: 2,000
easyQueryShare: 50
premiumToCheapPriceRatio: 6
routerOverheadPctOfBill: 2
setupEngHours: 40

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Rule-based router $0 ongoing, ships fast
Classifier-based router $50-500/mo overhead, smarter
Cascade router Highest savings, most eng

Start with rule-based for fast win. Upgrade to classifier when you have eval data. Move to cascade when scale justifies.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$5,813 / month ≈ $69,750 / year

Agent decides between simple actions (cheap model) and complex reasoning (premium). 70% easy split typical. ~$9K/mo savings on $15K bill.

Healthy range: Savings $9K/mo - biggest fit

See inputs used

currentMonthlyUsd: 15,000
easyQueryShare: 70
premiumToCheapPriceRatio: 8
routerOverheadPctOfBill: 1
setupEngHours: 24

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Doesn't model classifier accuracy variance - actual depends on training set.
Cascade overhead on edge cases isn't modeled (escalations cost extra).
Quality regression risk is workload-specific.
Doesn't include vendor abstraction layer (LiteLLM, custom) cost.

For these, use: Cheapest Model per task type. Cost Calculator for full bill.

Where to go next

Pick cheap-tier model per task →

Foundation for the routing strategy.

Stack caching with routing →

Two highest-ROI optimizations combined.

Stack 5 levers →

Routing + caching + reduction.

Methodology

Source: /ai-cost-economics
Extraction: Routing case studies from production migrations (anonymized).
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Multi-Model Router - Route Queries to the Cheapest Capable Model

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Multi-Model Router - Route Queries to the Cheapest Capable Model

Inputs you control

Outputs computed for you · model: `router`

What you're looking at

Ready to run the numbers?

Reading your result

Models for routing tiers (cheap/mid/premium)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: Multi-Model Router - Route Queries to the Cheapest Capable Model

Inputs you control

Outputs computed for you · model: router

What you're looking at

Ready to run the numbers?

Reading your result

Models for routing tiers (cheap/mid/premium)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `router`