RAG vs Fine-Tuning · strategic decision tool

Should you fine-tune or build RAG?

The question every AI team asks. Head-to-head cost comparison with realistic workload assumptions.

Pricing verified: 2026-06-05 CFO + CTO decision tool

What this calculator does

At what query volume does fine-tuning a smaller model beat running RAG on a big one?

Why use it

RAG has linear cost (cost scales with queries + context size)
Fine-tuning has a fixed training cost + lower per-query cost — break-even point depends on volume
Quality tradeoff is workload-specific — the calc models the cost side so you can reason about quality separately

Enter key values

Queries/month, context size, base model (big, for RAG), FT target (smaller, to fine-tune).

Set FT training cost

Rough estimate: $2000-$20000 depending on data volume. Default is typical.

Read break-even

Shows the queries/month at which FT becomes cheaper than RAG. Also shows cost at current volume.

Act on it

Below break-even: stick with RAG. Above: FT is cheaper, but evaluate quality carefully before switching.

Enter RAG and FT sides

RAG side: queries/month, retrieved context tokens (top-K × chunk size), base model (the big one you're currently using). FT side: target model (smaller — Haiku 4.5, GPT-5-mini, or vendor-hosted FT-capable model), training dataset size, training cost estimate. Fine-tuning has narrowed in 2026 as several providers wind down their FT platforms — verify the FT API is still available for your target model before committing to a multi-week training cycle.

Model the FT economics

Fine-tuning has a one-time training cost (amortized) + lower per-query cost (no retrieval, smaller model). RAG has no training cost but higher per-query cost (big model + context tokens). Break-even = training cost / (RAG per-query - FT per-query). At 10K queries/mo with $5K training cost, break-even is ~6 months. At 1M queries/mo, break-even is days.

Interpret the cost curves

Plot shows RAG cost (flat slope) vs FT cost (y-intercept = training cost, lower slope). Intersection = break-even. Beyond intersection, FT is cheaper per query cumulatively. Before intersection, RAG is cheaper. The quality question is separate — FT can be equal, better, or worse quality depending on task. Always evaluate on held-out data before switching.

Make the decision

If you're above break-even AND your task has stable patterns (repeated questions, structured outputs), FT is the play. If your corpus changes often, stick with RAG — FT goes stale. Hybrid: use RAG for fresh content, FT for reliable workflows. For the cost side of RAG specifically, see RAG Pipeline Cost. For FT cost details, see Fine-Tuning Cost.

🎛 Your use case

We'll compute both paths with realistic assumptions for each.

Production requests / day

Query tokens / req

Output tokens / req

📚 Your knowledge corpus

Documents

Tokens / doc

How often does corpus change? This is the #1 signal. Frequently-changing content = RAG wins decisively.

⚙ Approach settings

RAG - base LLM for generation

RAG - embedding model

Fine-tune - provider/model

Fine-tune dataset size (tokens) How much labeled training data you'd need to gather + prepare.

Analyzing...

📚 RAG approach

Embedding + retrieval + base LLM

year 1 total

🎓 Fine-tuning approach

Custom-trained smaller model

year 1 total

💡 Recommendations

⚖ Side-by-side tradeoffs

Cost is only one dimension. These factor into the real decision.

Dimension	📚 RAG	🎓 Fine-Tuning
Update cycle	Real-time (re-embed new docs)	Hours-to-days (re-train)
Data freshness	✅ Always current	⚠️ Frozen at training time
Citation / provenance	✅ Native (returns source docs)	❌ Black box
Style / tone learning	⚠️ Limited (prompt engineering)	✅ Strong (the main FT strength)
Domain vocabulary	⚠️ Base LLM's knowledge	✅ Learned from examples
Prompt length	Long (query + retrieved context)	Short (model knows context)
Latency	Retrieval + generation (200-500ms extra)	Faster (no retrieval)
Build time	Days (pipeline setup)	Weeks (label data + train + eval)
Hallucinations	✅ Grounded in retrieved docs	⚠️ Can hallucinate learned patterns
Multi-tenant safe	✅ Per-user index isolation	❌ Single model for all

RAG cost breakdown → FT cost breakdown → Get a RAG vs FT architecture review →

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

Should you fine-tune or build RAG?

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)