Guides → Playground & Guide → AI Cost Calculator - A First-Principles Guide to LLM Pricing

AI Cost Calculator - A First-Principles Guide to LLM Pricing

Meet Priya Patel. Product Manager launching her first AI feature. "I'm scoping an AI feature for next quarter. What will it cost - and how do I even think about this?"

🔥 Engineering says '$X per request', finance wants a monthly forecast, and the math doesn't reconcile.

The story

Every LLM bill comes down to four numbers: input tokens, output tokens, requests per day, model choice. That's it. Once you understand those, you can estimate cost for any AI feature in any vendor.

Priya's launching a customer-facing feature - answering questions about uploaded PDFs. Engineering told her '~$0.03 per request'. Finance wants a monthly number. Her boss wants comparison across 3 vendors. Same feature, three perspectives.

All three views come from the same calculation. The trick is getting the inputs right - and that's where most teams fumble. They guess at token counts, pick a model based on brand, and miss the 50%+ price differences between balanced-tier vendors that do equivalent work.

This guide walks through the four numbers, shows you the live pricing for every major vendor, and surfaces the three places teams systematically underestimate (output tokens, prompt overhead, vendor-mix opportunity).

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Model — The specific AI model that will run your workload. Pricing varies 100× across the catalog so this is the highest-leverage choice you make.

How to choose: Frontier work (research, deep reasoning, complex code) → Claude Opus 4.8 ($5/$25, Fast Mode $10/$50 at 2.5× speed), Gemini 3.1 Pro ($2/$12), or GPT-5.5 ($5/$30). Most production traffic → Claude Sonnet 4.6 ($3/$15) or Gemini 3 Flash ($0.50/$3). High-volume cheap tasks → Claude Haiku 4.5 ($1/$5), GPT-5-nano ($0.05/$0.40), or Grok 4.3 ($1.25/$2.50). Cheapest period: DeepSeek V3.2 ($0.28/$0.42). Test 2-3 candidates on your actual prompts before committing.

▸ Input tokens per request — Total tokens sent TO the model per call: system prompt + user message + retrieved context + tool definitions + conversation history.

How to choose: Do not guess from word count. Run actual prompts through token-estimator. Typical shapes: single chat turn 300-1500, RAG with retrievals 2K-8K, long-doc analysis 8K-100K.

▸ Output tokens per request — Total tokens the model generates back. Output tokens cost 3-5× more than input tokens at every major vendor.

How to choose: Constrain output in your prompt: "respond in ≤ 200 tokens" beats hoping the model is brief. Typical: classification 10-50, chat reply 150-600, code generation 500-2000, long-form 1000-4000.

▸ Requests per day — How many model calls you make per day at peak. Multiplies every other cost.

How to choose: For production apps use real telemetry. For projections: peak users × requests per user per day, plus 30% buffer for retries.

▸ Prompt cache hit rate — Fraction (0-100%) of input tokens reused across requests. Cached reads cost 90% less. Cache writes carry a 1.25× premium at Anthropic, so caching only saves money above ~22% hit rate.

How to choose: High-hit (60-90%): agent loops with stable system prompts, RAG with stable retrieval. Medium (20-50%): templated workflows. Low (0-10%): one-off creative work. Use 0% if unsure.

▸ Days per month — Multiplier from daily to monthly cost.

How to choose: 30 for 24/7 consumer apps. 22 for business-day-only internal tools. 31 for max-month projections.

📊 Outputs computed for you

What you'll see after the calculator runs. Each card explains how to read the number.

▸ Cost per request — Total cost for ONE call: (input tokens × input rate) + (output tokens × output rate), with caching and Batch discounts applied if enabled.

How to read: Frontier models typically $0.01-$0.30 per call. Mid-tier $0.001-$0.05. Cheap models $0.0001-$0.005.

▸ Monthly cost — Daily cost × days per month. The number you defend to finance.

How to read: Compare against your AI budget ceiling. If over, the calc surfaces savings opportunities — usually 30-90% available.

▸ Annual cost — Daily cost × 365. Linear projection — does NOT factor usage growth or vendor pricing changes.

How to read: For realistic annual projections, use annual-cost-forecaster. This number is your floor.

▸ Input vs output cost split — Splits cost into input-share vs output-share. Output typically dominates because it costs 3-5× more.

How to read: Output share over 70% → over-generating. Constrain output. Input share over 70% → consider prompt caching.

▸ Optimization suggestions — Concrete actions to reduce spend: prompt caching, Batch API (50% off), cheaper model swap.

How to read: Each suggestion shows estimated dollars saved. Implementation is usually 1-3 days for 30-90% reduction.

About this calculator: AI Cost Calculator - A First-Principles Guide to LLM Pricing

Walk through the four numbers that drive every LLM bill - input tokens, output tokens, requests/day, and model choice. Live pricing across 17 vendors.

Inputs you control

Input	Impact on result	Range	Typical
Input tokens per request	What you send to the model. System prompt + user message + any RAG context. Most teams underestimate this 2-3× because they forget the system prompt and any tool/function definitions.	100 – 100K	2000
Output tokens per request	What the model generates. Output is typically 3-5× more expensive than input - model choice matters most when outputs are long. Streaming a 500-token answer is dramatically different from generating a 4,000-token report.	50 – 8K	500
Requests per day	Across all users. 1,000 daily active users sending 1 request/day = 1,000. The same 1,000 users in a chat-heavy product might send 10-20 each - that's 10K-20K. Watch the difference.	10 – 100K	1000

Outputs computed for you · model: `token`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Input tokens per request 2,000

What you send to the model. System prompt + user message + any RAG context. Most teams underestimate this 2-3× because they forget the system prompt and any tool/function definitions.

Estimated: —

Output tokens per request 500

What the model generates. Output is typically 3-5× more expensive than input - model choice matters most when outputs are long. Streaming a 500-token answer is dramatically different from generating a 4,000-token report.

Estimated: —

Requests per day 1,000

Across all users. 1,000 daily active users sending 1 request/day = 1,000. The same 1,000 users in a chat-heavy product might send 10-20 each - that's 10K-20K. Watch the difference.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Three numbers come out: per-request cost, monthly cost, and per-vendor breakdown. Each tells you a different thing.

Per-request cost is what engineering quotes. It's the unit economics - useful for comparing vendors, useless for budgeting. A $0.03 per-request cost means nothing until you multiply by request volume.

Monthly cost is what finance wants. Per-request × requests/day × 30. This is the line item on the budget. Always include a 20-30% buffer for usage spikes.

Per-vendor breakdown tells you whether you're picking the right model. If DeepSeek is 80% cheaper than Anthropic for the same task - and your users can't tell the difference - that's a real signal. Sometimes the spread is justified (accuracy, latency); sometimes it's just inertia.

What "good" looks like:

Internal tool (~100/day): $5-50/mo healthy. Probably overthinking - just pick balanced tier.
Small SaaS (~1K/day): $50-300/mo healthy. Vendor choice starts to matter.
Mid-scale (~10K/day): $500-3K/mo. Cost optimization (caching, routing, batching) starts paying real dividends.
Consumer-scale (~100K/day): $5K-50K/mo. Multi-vendor routing is mandatory; volume discounts are negotiable; engineer time on optimization pays back quickly.

Top 3 vendors right now (cheap → premium)

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$10.81 / month ≈ $129.68 / year

Internal Q&A tool, ~100 employees querying 1×/day on average. Balanced tier (Sonnet 4.6 / GPT-5.5). Should land $10-30/mo. Don't over-engineer - at this scale, time spent optimizing costs more than you save.

Healthy range: $5-50/mo

See inputs used

inputTokens: 1,500
outputTokens: 300
requestsPerDay: 100
modelTier: balanced
workingDaysPerMonth: 22

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

DeepSeek V3 $0.27/$1.10 per 1M tokens
Gemini 3 Flash $0.30/$2.50 per 1M tokens
Anthropic Haiku 4.5 $1.00/$5.00 per 1M tokens

The cheapest option saves 80-95% vs. flagship. Three honest questions before defaulting to cheapest: (1) Can your users detect the quality difference? (2) Will your team review/test outputs to catch mistakes? (3) Does your domain tolerate 10-15% lower factual accuracy? If yes / yes / yes - go cheapest. Otherwise, pay for balanced.

Cost implication: At 1K req/day, switching from Anthropic Sonnet 4.6 to DeepSeek V3 saves ~$160/mo. At 100K req/day, it saves ~$16K/mo.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$2,767 / month ≈ $33,204 / year

Retrieval pulls 4-6 docs into context = 8K input tokens. Output stays modest (~600). 5K requests/day. Lands ~$2K/mo. RAG bills are 70%+ input - caching the system prompt + repeated docs cuts this 30-50%.

Healthy range: $1,000-3,000/mo (RAG is input-heavy)

See inputs used

inputTokens: 8,000
outputTokens: 600
requestsPerDay: 5,000
modelTier: balanced
workingDaysPerMonth: 30

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Token counts are approximations. Real prompts have hidden overhead (system prompts, function definitions, examples). Use the Token Estimator for ground truth.
Doesn't model prompt caching savings (Anthropic charges 10% for cached input - can cut RAG bills 30-50%). Use the Prompt Cache ROI calculator.
Doesn't model batch pricing (50% discount on most vendors for non-realtime workloads). Use the Batch vs Realtime calculator.
Doesn't include MLOps overhead - drift monitoring, eval pipelines, A/B testing.
Pricing is current as of last crawler run. Vendor pricing has moved 30-90% over the past 24 months - see the 3-year history.

For these, use: Token Estimator for accurate token counts. Prompt Cache ROI for caching savings. Batch vs Realtime for batch discounts.

Where to go next

Get accurate token counts for your real prompts →

Paste your actual prompt - see token count + cost across every vendor.

What happens at 10× usage? →

Project your bill at 10×, 100× current scale. Surface the cliffs.

Will this AI feature actually pay back? →

Hours saved × labor cost vs spend. CFO-defensible math in 30 seconds.

Methodology

Source: https://platform.claude.com/docs/en/about-claude/pricing
Extraction: Tier 0 deterministic parsers across 17 vendors, daily auto-fetch + LiteLLM cross-check
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

AI Cost Calculator - A First-Principles Guide to LLM Pricing

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: AI Cost Calculator - A First-Principles Guide to LLM Pricing

Inputs you control

Outputs computed for you · model: `token`

What you're looking at

Ready to run the numbers?

Reading your result

Top 3 vendors right now (cheap → premium)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

📊 Outputs computed for you

About this calculator: AI Cost Calculator - A First-Principles Guide to LLM Pricing

Inputs you control

Outputs computed for you · model: token

What you're looking at

Ready to run the numbers?

Reading your result

Top 3 vendors right now (cheap → premium)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `token`