Guides → Playground & Guide → Fine-Tuning Cost - Training + Inference Break-Even
Meet Faisal Ahmad. ML Engineer at a 100-person legal tech company. "We have 50K legal contracts as training data. Fine-tuning quote was $4K. Will it actually save money?"
🔥 Decision needed by sprint planning Monday.
Fine-tuning math is two-sided: training cost upfront, inference savings ongoing. Training cost depends on base model + dataset size. Inference savings come from using smaller fine-tuned models in place of frontier ones at lower per-token rates. Break-even is the volume × time at which inference savings overcome training cost.
Faisal's case: 50K contracts × ~5K tokens each = 250M training tokens. OpenAI fine-tuning: ~$8/1M training tokens × 250M = $2,000 + $4 reserved for the inference layer. Anthropic similar. Plus eval time + retraining iterations. Realistic total: $4-8K all-in for the first deployment.
Three break-even modes. (1) Volume-driven: high query volume against a fine-tuned smaller model beats lower-volume frontier. (2) Quality-driven: domain-specific FT outperforms frontier on narrow tasks (legal, medical, code). (3) Latency-driven: smaller FT model is faster than frontier - wins for voice/realtime.
Fine-tuning math: training compute + tokens + base model selection + inference savings. When custom models pay back - and when they don't.
fine_tuning
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Training cost is upfront and fixed. $2-15K typical for production FT. Re-train cost = same as initial when you upgrade base model or substantially update training data.
Monthly savings = (frontier cost - FT cost) × queries/month. At Faisal's scale: ~$650/mo savings. Training pays back in ~6 months.
Re-training timeline matters. If you'll re-train in 9 months (vendor releases new base model), you only have 9 months to amortize. Tighter break-even threshold.
Quality lift is the bigger story usually. FT often improves task-specific accuracy 10-30% vs frontier with prompting. That's worth more than the inference savings in most production workflows.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
500 queries/day. Frontier cost: $90/mo. FT savings: $72/mo. Training $3K. Payback: 42 months - longer than the model lifetime. Don't fine-tune.
Healthy range: Training never amortizes
Faisal's legal contract analyzer. 5K queries/day on FT model saves ~$650/mo vs frontier. Training $4K. Payback ~6 months. Plus 20%+ accuracy lift on legal domain. Strong yes.
Healthy range: Payback ~6 months, plus quality lift
100K queries/day. Frontier cost: $35K/mo. FT savings: ~$30K/mo. Training $12K. Payback: 2 weeks. At this scale, FT is mandatory - engineering effort is the constraint, not cost.
Healthy range: Saves $30K+/mo, payback <2 weeks
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Hosted FT (OpenAI, Anthropic, Google) is operationally easy. Self-hosted FT is dramatically cheaper at scale - if you have ML/SRE capacity. Default to hosted; switch when scale demands.
Fine-tuned models hallucinate WITHIN their domain - confidently wrong on edge cases the training set didn't cover. Eval harness is non-negotiable for production FT.
Where did training data come from? Is PII scrubbed? Can you reproduce the dataset? Compliance teams ask all of these. Document the answers.
Hosted FT default tier sometimes uses your training data for vendor improvements. Verify enterprise no-train language.
Latency win is often as valuable as cost win. Voice agents, autocomplete, search - sub-300ms requirements are met by small FT, not by frontier.
FT on Vendor X ≠ usable on Vendor Y. Switching is full retraining. Plan for it. Self-hosted FT (Llama, Mistral) is more portable but adds ops burden.
FT MLOps is the silent cost. Eval pipelines, drift monitoring, retraining cycles, A/B tests, version management. Budget 0.5-1 ML engineer per fine-tuned model in production.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
500K completions/day. Cheap-tier FT model serves the easy cases (~80%). Frontier handles complex (~20%). Saves $100K+/mo on the easy-case volume. Plus latency wins for autocomplete UX.
Healthy range: Mandatory at this scale
ICD-10 coding, legal citation matching. Domain rules are stable. FT encodes conventions reliably. Inference savings ($800/mo) are real but secondary - the 25% accuracy lift over prompted frontier is the win.
Healthy range: Quality lift dominates ROI
Sub-300ms classifier for voice agent intent routing. Tiny FT model (Mistral 7B fine-tuned). 15× cheaper than frontier per query. Plus 200ms latency advantage. Voice UX requires this.
Healthy range: Latency + cost double win
30K tickets/day, 5-class categorization. Cheap FT model accurate enough. Saves ~$1.8K/mo vs frontier. Payback 2 months. Plus narrow output schema means deterministic structure.
Healthy range: Payback ~2 months
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: RAG vs Fine-Tuning for the strategic decision. Self-Host Break-even for FT-on-own-GPUs.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →