Guides → Playground & Guide → Batch vs Realtime - How Much of Your AI Bill Is Discountable?
Meet Tariq Hassan. Engineering Manager at a 50-person SaaS. "AWS sales said batch saves 50%. Sounds great - but how much of my AI workload can actually run in batch mode?"
🔥 Need to deliver 30% AI cost reduction this quarter.
Batch pricing is real money - and underused. OpenAI, Anthropic, Google, and Mistral all offer ~50% discount on batch (non-realtime, async) workloads. The question isn't 'is batch cheaper' (yes, by 50%). It's 'what fraction of your workload is actually batch-eligible?'
Tariq's bill is $8K/mo. He assumes 'most of it is interactive' and dismisses batch. Reality check: classifications, summarizations, embeddings, content moderation, daily reports, weekly digests - typically 30-50% of a SaaS's AI workload doesn't need realtime response. The user doesn't see the request happen.
The math is simple but the audit takes work. Walk through every AI call type. For each: is the user blocked waiting? If no → batch-eligible. If yes → realtime. Apply 50% discount to the 'no' bucket and recompute the bill.
This guide walks through Tariq's audit, identifies common batch-eligible workloads, and shows how to migrate without breaking UX.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
What you'll see after the calculator runs. Each card explains how to read the number.
Most vendors offer 50% off batch processing. The question isn't 'is batch cheaper' - it's 'what fraction of your workload is actually batch-eligible?'
batch
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Read the savings number. Monthly bill × batch-eligible % × discount % = monthly savings. For Tariq: $8K × 35% × 50% = $1,400/mo savings = $16.8K/year.
Watch the migration cost. Each batch-eligible workload needs code changes (queue submission, polling for results, error handling). Budget 1-3 days of engineering per workload type. Tariq has 4 workload types → ~10 days = ~$8K eng cost. Payback: 6 months.
The bigger savings: vendor competition. Once your workloads are batch-capable, you can shop the batch tier across vendors. DeepSeek batch is significantly cheaper than OpenAI batch for similar quality. Saves another 20-40% on the batch portion.
Don't over-batch. Some workloads look batch-eligible but aren't - anything where user retention hinges on speed (autocomplete, voice, instant feedback). Misclassifying these breaks UX worse than the savings is worth.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Customer-facing chatbot. 90% interactive, 10% background (content moderation, archival summarization). Savings $250/mo vs migration cost $4-6K eng. Payback >12 months. Skip - invest in caching/routing instead.
Healthy range: Savings ~$250/mo - probably not worth migration
Tariq's bill - $8K with 35% batch-eligible (classifications, daily reports, embedding refresh, content moderation). Saves $1.4K/mo. Migration cost ~$8K. Payback 6 months. Worth it.
Healthy range: Savings $1.4K/mo - payback ~6mo
Content site doing AI summarization, tagging, translation, image alt-text - all batch-eligible. $25K bill, 70% batchable. Savings $8.75K/mo = $105K/yr. Mandatory; migrate this month.
Healthy range: Savings $8.75K/mo - major win
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Batch migration ROI is dominated by the eligibility audit. Get that right, and the math works. Get it wrong, you ship batch infra that captures 5% savings on 80% of your bill - and nobody can tell why.
Batch tier on most vendors is the SAME model, just queued. No quality compromise. The 50% discount is for accepting latency (typically minutes-to-hours).
Batch tier doesn't downgrade compliance. Anthropic batch with BAA = realtime with BAA. Verify in contract though.
Same as compliance - batch keeps your privacy tier. Don't sacrifice no-train to save money.
The latency difference is the whole point. If user is waiting, realtime. If user isn't, batch. Misclassifying breaks UX.
Batch APIs across vendors are similar (submit job, poll for results). Architecturally portable. Easier to multi-vendor than realtime APIs.
Batch adds operational surface - queue, polling, error handling. Manageable, but real. Budget the eng time + observability investment.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Re-embed new docs nightly. Pure batch - nobody waits for it. Vendor choice: cheapest batch tier wins. OpenAI text-embedding-3 batch is hard to beat.
Healthy range: 100% batchable - cuts cost in half
User-generated content classifiers. New posts: realtime (block harmful before display). Existing content re-scan: batch. 80% of volume is the re-scan. Migrate the re-scan; keep new-post stream realtime.
Healthy range: 80% batchable (delayed-OK content)
AI-generated weekly digests, sales summaries, support trend reports. Run overnight Sunday → ready Monday morning. Zero user-facing latency. Cuts bill 50%.
Healthy range: 100% batchable - set and forget
AI summarizes closed tickets for analytics. Closed tickets aren't urgent - batch them nightly. Live agent assist (40%) stays realtime.
Healthy range: 60% batchable (post-conversation summaries)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Cost Calculator for per-workload pricing. Multi-Model Router for routing layer. Prompt Cache ROI for additional optimization.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →