Guides → Playground & Guide → Agentic Workflow Cost - A Guide for Engineering Leaders
Meet Sarah Chen. VP Engineering at a 50-person SaaS. "I'm rolling out Claude Code to all 5 senior devs. Will this kill my cloud budget?"
🔥 CFO is asking for a 12-month forecast by Friday.
Sarah's team adopted Claude Code 30 days ago. The first month bill: $9,400. Her CFO wants a forecast - month-2 projection, full-year, and a comparison vs. not doing this.
Looking at her actuals: 5 devs averaged ~80 tasks per day, ~100K tokens per task, on Sonnet 4.6. About half the input was cached system prompts and repeated context. She wasn't using batch processing - agents need realtime.
Here's the question every engineering leader is facing right now: at what scale does agentic AI become a budget problem? The answer depends on three numbers most teams aren't measuring: tasks per day, tokens per task, and cache hit rate. This guide walks you through Sarah's real numbers, then lets you plug in yours.
Estimate monthly burn for coding agents and autonomous workflows across 4 vendors. Walks through Sarah's 5-dev team scenario with live pricing and 3-year history.
Below: live sliders. Move them to see numbers change in real time.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Your monthly number is just the inference bill. Three things to check before you take it to your CFO:
Per-developer normalization. Divide monthly cost by team size. Healthy code-agent spend: $200-800/dev/month. Above $1500/dev/month: investigate. You're either using a too-premium model, your token estimates are off, or your team is using the agent for things it shouldn't be doing.
Vendor spread. The per-vendor breakdown tells you whether you're picking the right model. If DeepSeek is 90% cheaper at the same quality benchmarks for your use case, and your CFO finds out, that's an awkward conversation. Sometimes the spread is real (latency, accuracy needs); sometimes it's just inertia.
Runaway buffer. Your number assumes a 1.2× safety multiplier. In practice, agent loops occasionally do 3-5× their expected work. Budget the buffer; track actuals weekly.
Currently: Anthropic Sonnet 4.6 $105.42 / mo
Switching vendors is a 3-6 month decision. We help model the full switching cost, not just the inference price. Get a vendor migration analysis →
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
One developer, balanced tier, 60% cache hit (lots of repeated system prompts). Expect ~$120/mo at this profile.
Healthy range: $50-$200/mo
Sarah's actual config. Coming in around $342/mo on Sonnet 4.6. Her actual was $9,400/mo - meaning her real tokensPerTask is closer to 800K, or modelTier is premium, or both. The calc surfaces what to investigate.
Healthy range: $300-$800/mo
Premium tier (Opus) for complex codebases at scale. ~$8K/mo realistic - should be using volume discounts (Anthropic Enterprise tier negotiates 20-40% off list).
Healthy range: $4K-$12K/mo
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Cheapest models can save 80-90% vs flagship. But they hallucinate more, have weaker reasoning, and can fail silently on edge cases. Use them for tasks where you can verify output cheaply (humans review, structured outputs, code that gets tested).
Significantly higher hallucination risk. Not for production-critical workloads without strong verification.
For code generation, hallucination = wrong code = reviewer time. At a 5% hallucination rate vs 2%, you're adding ~1hr/dev/week of review work. At $100/hr loaded cost × 5 devs × 4 weeks = $2,000/mo - exactly erasing the model savings. Math the hidden cost.
SOC 2 Type II, HIPAA BAA available, GDPR-aligned, EU AI Act ready.
Strong fit for EU regulated industries.
Verify per-product. Not yet ready for HIPAA, FedRAMP, or strict EU requirements.
Compliance isn't just checkboxes - it's audit-trail obligations, data residency, training opt-out. Major vendors have it. Smaller open-source providers - verify in your contract, not just on the marketing page.
Anthropic, OpenAI, Google all default to no-train on Enterprise tier.
Highest privacy posture - but you operate the infrastructure.
Privacy posture varies by tier within the same vendor. ChatGPT consumer plan: trains on data. ChatGPT Team/Enterprise: doesn't. Always verify your tier - and write the no-train clause into your contract, not just rely on the privacy page.
Smaller models = faster. For code agents, latency matters less (devs are reading output). For voice agents, you need sub-300ms TTFT or interruptions feel awkward. Most coding agents are fine with balanced tier - the model thinks for 1-3s anyway.
Use a router (LiteLLM, our Multi-Model Router calc) to swap vendors per task type.
Plan a 3-6 month switching cost if pricing changes 50%+. Keep prompt portability in mind.
Single-vendor stacks have a 3-6 month switching cost. The first month: identify the new vendor and run pilots. Months 2-3: rewrite prompts (each vendor has tuning quirks). Months 4-6: A/B test in production. Plan it; pricing in this market changes 30-50% per year.
Vendor handles drift. Pay the convenience tax.
Drift monitoring, eval pipelines, retraining cycles. Budget for it.
GPU rental, version management, security patches, model upgrades. Significant operational burden.
MLOps is the silent cost. Self-hosting saves on inference but adds drift monitoring, eval pipelines, GPU management, and security patches. For most teams under 100 ML engineers, API + multi-vendor routing is the right answer.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Customer support automation - high volume, small tasks, lots of cached system prompt. Per-ticket cost should land $0.005-$0.02. Above $0.05/ticket: you're using too-premium a model.
Healthy range: $100-$500/mo · per-ticket: $0.005-$0.02
Premium tier justified - code review demands accuracy. Higher input share (85%) because most tokens are the codebase being reviewed. 40% cache (some shared system prompt + repeated style guides).
Healthy range: $400-$1,500/mo
Output-heavy (40% input, 60% output) - drafting consumes few tokens, produces many. 50% batch-eligible (overnight social copy). Cheap tier (Haiku/Flash) is fine - humans review.
Healthy range: <$200/mo
Long contracts as input (90% input share). Premium tier mandatory - accuracy + reasoning. Realtime (no batch) because lawyers iterate. Larger runaway buffer (1.3×) because legal queries trigger deeper analysis chains.
Healthy range: $800-$2,000/mo
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: ROI Quick Check models reviewer time. Full TCO Wizard includes MLOps and learning curve. Token Estimator measures your real prompts.
Hours saved × loaded cost vs your AI spend. The math your CFO actually wants.
Stop paying flagship for easy queries →Route low-difficulty tasks to cheap models, hard ones to premium. Typical savings: 40-60%.
Full TCO with HITL, drift, compliance →7-step wizard. The deliverable you hand to your CFO.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →