Guides → Playground & Guide → AI Cost Calculator - A First-Principles Guide to LLM Pricing
Meet Priya Patel. Product Manager launching her first AI feature. "I'm scoping an AI feature for next quarter. What will it cost - and how do I even think about this?"
🔥 Engineering says '$X per request', finance wants a monthly forecast, and the math doesn't reconcile.
Every LLM bill comes down to four numbers: input tokens, output tokens, requests per day, model choice. That's it. Once you understand those, you can estimate cost for any AI feature in any vendor.
Priya's launching a customer-facing feature - answering questions about uploaded PDFs. Engineering told her '~$0.03 per request'. Finance wants a monthly number. Her boss wants comparison across 3 vendors. Same feature, three perspectives.
All three views come from the same calculation. The trick is getting the inputs right - and that's where most teams fumble. They guess at token counts, pick a model based on brand, and miss the 50%+ price differences between balanced-tier vendors that do equivalent work.
This guide walks through the four numbers, shows you the live pricing for every major vendor, and surfaces the three places teams systematically underestimate (output tokens, prompt overhead, vendor-mix opportunity).
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
What you'll see after the calculator runs. Each card explains how to read the number.
Walk through the four numbers that drive every LLM bill - input tokens, output tokens, requests/day, and model choice. Live pricing across 17 vendors.
token
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Three numbers come out: per-request cost, monthly cost, and per-vendor breakdown. Each tells you a different thing.
Per-request cost is what engineering quotes. It's the unit economics - useful for comparing vendors, useless for budgeting. A $0.03 per-request cost means nothing until you multiply by request volume.
Monthly cost is what finance wants. Per-request × requests/day × 30. This is the line item on the budget. Always include a 20-30% buffer for usage spikes.
Per-vendor breakdown tells you whether you're picking the right model. If DeepSeek is 80% cheaper than Anthropic for the same task - and your users can't tell the difference - that's a real signal. Sometimes the spread is justified (accuracy, latency); sometimes it's just inertia.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Internal Q&A tool, ~100 employees querying 1×/day on average. Balanced tier (Sonnet 4.6 / GPT-5.5). Should land $10-30/mo. Don't over-engineer - at this scale, time spent optimizing costs more than you save.
Healthy range: $5-50/mo
Customer-facing feature, 1,000 daily active users sending 1 request/day. Balanced tier. Lands ~$200/mo. Vendor choice now starts to matter - DeepSeek/Gemini Flash could cut this 60-80%.
Healthy range: $50-300/mo
10K requests/day with longer outputs. ~$2K/mo on Sonnet. At this scale, prompt caching (Anthropic charges 10% for cached) and multi-model routing typically save 40-60%.
Healthy range: $500-3,000/mo
Cost isn't the only dimension. Click any constraint — see how recommendations change.
The cheapest option saves 80-95% vs. flagship. Three honest questions before defaulting to cheapest: (1) Can your users detect the quality difference? (2) Will your team review/test outputs to catch mistakes? (3) Does your domain tolerate 10-15% lower factual accuracy? If yes / yes / yes - go cheapest. Otherwise, pay for balanced.
12-15 percentage points lower. For customer-facing features, this is the difference between 'helpful tool' and 'embarrassing PR incident'.
Hallucination cost shows up as: customer support tickets, content corrections, eroded trust. None of those appear in your inference bill - but they're real. For consumer-facing features, the balanced tier (Sonnet 4.6 / GPT-5.5) is usually the right call.
Don't assume - read the contract.
Compliance turns into deal-breakers fast. If you're selling into regulated industries (healthcare, finance, education), pick a vendor with the right certifications BEFORE building. Migrating later costs 2-3 months.
Highest privacy posture, but you operate the infrastructure.
Privacy posture varies by tier within the same vendor. Default API tier on most vendors does NOT use your data for training - but verify in your contract. Free tiers usually do.
For chat, latency to first token matters more than total time. Streaming + cached system prompts make a 'feels-fast' UX even with balanced-tier models. For voice agents, sub-300ms TTFT is mandatory - see the <a href='/tools/voice-agent-stack'>Voice Agent Stack</a> guide.
Pricing can change 30-50% per year. Plan for it.
The single-vendor honeymoon is real for the first 6 months. Then a price hike or capability gap forces a swap, and you discover that prompt portability is harder than expected. Build with portability in mind from day 1.
MLOps is the silent cost. For most teams under 100 ML engineers, API + multi-vendor routing is the right call. Self-host only when privacy or latency makes APIs infeasible.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Retrieval pulls 4-6 docs into context = 8K input tokens. Output stays modest (~600). 5K requests/day. Lands ~$2K/mo. RAG bills are 70%+ input - caching the system prompt + repeated docs cuts this 30-50%.
Healthy range: $1,000-3,000/mo (RAG is input-heavy)
Marketing copy generation. Short prompts (~1K input), long outputs (~4K). 200 generations/day, premium tier (quality matters). Lands ~$600/mo. Output-heavy - premium tier hurts here; consider balanced tier with human review.
Healthy range: $300-1,000/mo (output-dominated)
30-min meeting transcribed (~6K tokens input), summarized to ~800 tokens. 500 meetings/day across team. ~$400/mo for the summary step. (Transcription is separate - see the <a href='/tools/audio-cost'>Audio Cost</a> guide.)
Healthy range: $200-700/mo (excludes transcription)
Embeddings are 10-30× cheaper than chat. Text-embedding-3 from OpenAI is $0.02/1M tokens. Indexing 50K docs of 1,500 tokens each: $1.50 one-time. Ongoing for new docs: pennies. See the <a href='/tools/embedding-cost'>Embedding Cost</a> guide for the full RAG-pipeline picture.
Healthy range: $50-200 one-time + $5-30/mo ongoing
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Token Estimator for accurate token counts. Prompt Cache ROI for caching savings. Batch vs Realtime for batch discounts.
Paste your actual prompt - see token count + cost across every vendor.
What happens at 10× usage? →Project your bill at 10×, 100× current scale. Surface the cliffs.
Will this AI feature actually pay back? →Hours saved × labor cost vs spend. CFO-defensible math in 30 seconds.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →