Guides → Playground & Guide → Fine-Tuning Cost - Training + Inference Break-Even

Fine-Tuning Cost - Training + Inference Break-Even

Meet Faisal Ahmad. ML Engineer at a 100-person legal tech company. "We have 50K legal contracts as training data. Fine-tuning quote was $4K. Will it actually save money?"

🔥 Decision needed by sprint planning Monday.

The story

Fine-tuning math is two-sided: training cost upfront, inference savings ongoing. Training cost depends on base model + dataset size. Inference savings come from using smaller fine-tuned models in place of frontier ones at lower per-token rates. Break-even is the volume × time at which inference savings overcome training cost.

Faisal's case: 50K contracts × ~5K tokens each = 250M training tokens. OpenAI fine-tuning: ~$8/1M training tokens × 250M = $2,000 + $4 reserved for the inference layer. Anthropic similar. Plus eval time + retraining iterations. Realistic total: $4-8K all-in for the first deployment.

Three break-even modes. (1) Volume-driven: high query volume against a fine-tuned smaller model beats lower-volume frontier. (2) Quality-driven: domain-specific FT outperforms frontier on narrow tasks (legal, medical, code). (3) Latency-driven: smaller FT model is faster than frontier - wins for voice/realtime.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

About this calculator: Fine-Tuning Cost - Training + Inference Break-Even

Fine-tuning math: training compute + tokens + base model selection + inference savings. When custom models pay back - and when they don't.

Inputs you control

Input	Impact on result	Range	Typical
Training corpus tokens (millions)	Total tokens in your training data. 50K docs × 5K tokens = 250M. Most production FT uses 100M-1B.	1 – 10K	250
Queries per day (FT model serves)	Daily queries that would route to the FT model. Higher volume = faster amortization.	100 – 10M	5000
Frontier-to-FT cost ratio per query	How much cheaper FT model inference is vs frontier. GPT-4-class to GPT-3.5-FT: ~10×. Sonnet to Haiku-FT: ~6×. Lower ratio = harder to justify training.	1.5 – 30	6

Outputs computed for you · model: `fine_tuning`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Training corpus tokens (millions) 250

Total tokens in your training data. 50K docs × 5K tokens = 250M. Most production FT uses 100M-1B.

Estimated: —

Queries per day (FT model serves) 5,000

Daily queries that would route to the FT model. Higher volume = faster amortization.

Estimated: —

Frontier-to-FT cost ratio per query 6

How much cheaper FT model inference is vs frontier. GPT-4-class to GPT-3.5-FT: ~10×. Sonnet to Haiku-FT: ~6×. Lower ratio = harder to justify training.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Training cost is upfront and fixed. $2-15K typical for production FT. Re-train cost = same as initial when you upgrade base model or substantially update training data.

Monthly savings = (frontier cost - FT cost) × queries/month. At Faisal's scale: ~$650/mo savings. Training pays back in ~6 months.

Re-training timeline matters. If you'll re-train in 9 months (vendor releases new base model), you only have 9 months to amortize. Tighter break-even threshold.

Quality lift is the bigger story usually. FT often improves task-specific accuracy 10-30% vs frontier with prompting. That's worth more than the inference savings in most production workflows.

What "good" looks like:

Strong FT fit: >5K queries/day + narrow domain + stable training data + $300+/mo current spend
Marginal: 1-5K queries/day, modest domain specificity
Skip FT: <1K queries/day OR rapidly changing domain (use RAG)
Quality-driven (volume secondary): Narrow tasks where frontier hits ceiling - legal, medical coding, format-rigid output

Fine-tuning capable vendors right now

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$269.65 / month ≈ $3,236 / year

500 queries/day. Frontier cost: $90/mo. FT savings: $72/mo. Training $3K. Payback: 42 months - longer than the model lifetime. Don't fine-tune.

Healthy range: Training never amortizes

See inputs used

trainingTokens: 100
queriesPerDay: 500
frontierVsFtCostRatio: 5
trainingCostUsd: 3,000
inputTokensPerQuery: 2,000
outputTokensPerQuery: 400

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

OpenAI GPT-4o-mini FT (legacy) $3/1M training, $0.30/1M inference
Anthropic Haiku FT (when GA) Similar economics
Self-hosted Llama / Mistral FT GPU rental + ops, lowest unit cost at scale

Hosted FT (OpenAI, Anthropic, Google) is operationally easy. Self-hosted FT is dramatically cheaper at scale - if you have ML/SRE capacity. Default to hosted; switch when scale demands.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$4,482 / month ≈ $53,785 / year

500K completions/day. Cheap-tier FT model serves the easy cases (~80%). Frontier handles complex (~20%). Saves $100K+/mo on the easy-case volume. Plus latency wins for autocomplete UX.

Healthy range: Mandatory at this scale

See inputs used

trainingTokens: 1,000
queriesPerDay: 500,000
frontierVsFtCostRatio: 10
trainingCostUsd: 15,000
inputTokensPerQuery: 800
outputTokensPerQuery: 100

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Training cost varies dramatically by base model (Mini vs full GPT-4-class can be 10×).
Doesn't include eval iteration cost (typically 30-50% on top of training).
Doesn't model the engineering time to prepare training data + eval harness.
Quality lift is workload-specific - measure on your task before committing.

For these, use: RAG vs Fine-Tuning for the strategic decision. Self-Host Break-even for FT-on-own-GPUs.

Where to go next

Should you RAG instead? →

When RAG is the better strategic call.

Self-hosted FT break-even →

GPU rental + FT ops vs hosted FT.

FT vendor lock-in exposure →

Multi-vendor strategy with FT.

Methodology

Source: https://platform.openai.com/docs/guides/fine-tuning
Extraction: Per-vendor FT pricing extracted weekly. Quality benchmarks from published case studies.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Fine-Tuning Cost - Training + Inference Break-Even

The story

About this calculator: Fine-Tuning Cost - Training + Inference Break-Even

Inputs you control

Outputs computed for you · model: `fine_tuning`

What you're looking at

Ready to run the numbers?

Reading your result

Fine-tuning capable vendors right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Fine-Tuning Cost - Training + Inference Break-Even

Inputs you control

Outputs computed for you · model: fine_tuning

What you're looking at

Ready to run the numbers?

Reading your result

Fine-tuning capable vendors right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `fine_tuning`