AI API Cost Calculator

What does your AI feature actually cost?

Pick a model. Set your workload. See daily, monthly, and annual cost - with the real optimizations most teams miss.

Pricing verified: 2026-06-05 161 models across 8 providers Caching + batch API applied

What this calculator does

See exactly what an LLM workload will cost across 70+ models. Pick a model, enter your tokens per request and daily volume, get per-request / daily / monthly / annual cost. Caching and Batch API savings calculated automatically.

Why use it

Stop guessing — turn "AI is expensive" into a precise monthly number you can defend to finance
Compare 70+ models side-by-side at YOUR token shape, not vendor marketing examples
Spot the 30-90% savings opportunities (prompt caching, Batch API, model swap) before you ship
Re-cost instantly when a vendor changes rates — your numbers stay current

📖 Read the full guide →

Who uses this:

Vibe Coder High Small Business High Enterprise High

These are the inputs, outputs, and how you can use this calculator for your AI workloads.

📥 Inputs you provide

ModelPick from 70+ AI models
Input tokens per requestSize of your prompt
Output tokens per requestExpected response size
Requests per dayYour daily call volume
Prompt cache hit rateHow often your prompt prefix repeats
Days per monthWorking days for billing math

📤 Outputs you get

Cost per requestDollars per single API call
Monthly costDollars per month at your volume
Annual costLinear annual projection
Input vs output cost splitWhere the money goes
Optimization suggestionsHow to cut the bill

🎯 Use your results to

🎯

Pick the right model

Run the same workload through 5 candidates; pick the cheapest that meets your quality bar

📈

Forecast your AI bill

Defensible monthly + annual numbers for your finance team

💾

Quantify savings

Estimated dollars from caching, Batch API, and model swap — before you implement

🔌

Integrate with your AI agents

MCP available for agentic workflow integration — surface live cost intelligence to your agents

👇 Now try the calculator below with your own AI workloads

📊 Calculator at a glance

📅 Schedule a meeting via AvatarVA ✉️ Email [email protected]

🎛 CALCULATOR

Your workload

Estimate conservatively - we'll show you what caching + batch mode save below.

Quick preset Load a typical workload, then tweak the numbers.

Model

Input tokens per request The prompt + system message + context sent to the model. ~4 chars ≈ 1 token.

Output tokens per request The model's reply. Usually the bigger line item (5x input rate).

Requests per day

Prompt cache hit rate 0% If you reuse the same system prompt, apps typically see 30-50% hit rate. 90% off on cached tokens.

Batch API (50% off) Use if latency > 10 min is acceptable (async jobs, reports, nightly runs)

Days per month

Compare all models →

📈 RESULTS

💰 Your estimated cost

Loading…

Monthly cost

Per request-

Per day-

Input tokens/day-

Output tokens/day-

Input cost share-

Output cost share-

Annual-

Monthly tokens-

📋 What now?

Compare models — switch the model dropdown to see the same workload across 70+ options
Lock in savings — toggle caching and Batch mode to surface the 30-90% reductions before you ship
Set your budget — use the monthly + annual numbers as defensible inputs for finance

Need help cutting your AI bill? 💼 Talk to a CloudIntelligence advisor →

Now that you have your number…

What this means + what to do next

💡 What to consider beyond this number for full TCO

Observability + logging (prompts, outputs, latency, errors) — typically adds 5-10% to inference cost at production scale
Eval pipelines + benchmark sets — $500-$5K/mo even without continuous evaluation; budget more if quality drift matters
Human-in-the-loop review for edge cases — $4K-$12K/mo per FTE reviewer for production AI features
Retry / fallback overhead — typically 3-15% on top of base inference depending on error rate and retry logic
Vendor lock-in cost — invisible until migration day, often $50K+ in re-prompting + re-eval + downtime risk

Rule of thumb: Multiply this number by 1.5–2.5× for production-ready TCO. Lower end (1.5×) = internal tools with low error tolerance and no compliance overhead. Higher end (2.5×) = customer-facing AI features with eval pipelines, compliance logging, and human review.

Quantify the hidden costs:

If your workload is multi-turn (chat, agents, tool-using), costs compound per turn — this baseline misses that Agent Loop Cost
Quantifies lock-in cost on the day you need to switch vendors Vendor Concentration Risk
If you're adding retrieval, the embedding + vector DB + rerank costs aren't in this baseline Rag Pipeline

$ How this fits your overall ROI

This calculator gives you the cost number. Here's how to turn that into an ROI story:

What revenue or cost-saved does this AI feature drive monthly?
How long until cumulative AI cost exceeds the value the feature generates?
How sensitive is your business to vendor price changes? (Last 12 months saw -50% to +25% swings across major vendors.)

Bridge to ROI:

Convert per-request cost into per-customer or per-feature margin Margin Calculator
Project 12 months out with growth + price-change assumptions Annual Cost Forecaster
See cost at 10× and 100× current usage — the discontinuities matter Scale Projection

Doing something different?

Doing something different? These calculators may fit better:

For multi-turn agent loops with tool calls Agent Loop Cost
For full RAG over a knowledge base with embeddings + retrieval Rag Pipeline
For image / multimodal workloads where pricing differs Vision Cost

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

What does your AI feature actually cost?

What this means + what to do next

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)