Guides → Playground & Guide → RAG vs Fine-Tuning - When Each Wins (and Where Break-Even Is)
Meet Maya Iyer. ML Lead at a 80-person FinTech. "Should we RAG our docs for the support chatbot or fine-tune a model on them?"
🔥 Wrong choice = 3 months of wasted engineering. Right choice cuts inference cost 40%.
RAG and fine-tuning solve different problems but get conflated constantly. RAG injects fresh context at query time. Fine-tuning bakes patterns into the model. Teams pick wrong because they optimize for the wrong axis - usually 'easier to ship' over 'right for the workload.'
Maya's chatbot answers questions about her FinTech's product docs. Docs change weekly. Volume: 5K queries/day, mostly recurring patterns. RAG is the obvious starting point - but at 5K queries/day the per-query input cost (system + 4 retrieved docs) is 6× a fine-tuned response with no retrieval.
Break-even is usage volume × document stability. Below 1K queries/day, RAG always wins (FT setup cost dominates). Above 10K queries/day with stable docs, fine-tuning typically wins by 30-60%. In Maya's middle ground (5K queries/day, weekly doc updates), the answer is nuanced - and depends on update frequency more than volume.
This guide walks through the decision honestly: when RAG wins, when fine-tuning wins, and why hybrid approaches (RAG + small fine-tuned classifier) often beat both.
RAG ships fast and adapts to fresh data; fine-tuning is cheaper at scale. Find the break-even - and avoid choosing wrong on a 6-month commitment.
fine_tuning
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Read the cost-per-query gap. RAG per-query cost includes retrieved context (4-8K input tokens). Fine-tuned per-query is ~1K input. At identical volumes, FT inference is 4-6× cheaper.
Watch the amortization line. FT has a one-time training cost ($2K-15K depending on model). Divided over 12 months at your volume, when does FT total cost cross under RAG cumulative?
Update frequency is the killer for FT. If docs change weekly and you retrain monthly, your FT model is always 1-4 weeks stale. RAG has no staleness. This is the dominant factor for content-heavy use cases.
Hybrid often wins. Use FT for the routing/classification layer (fast, cheap, narrow). Use RAG for the answer generation (fresh context). Most production AI assistants converge on this pattern by month 6.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Low volume, high update frequency, modest specificity. FT setup ($5K) never amortizes. RAG inference cost is small enough that the FT savings can't compete. Don't fine-tune.
Healthy range: RAG clearly wins by 60-80% over 12mo
RAG: ~$2,400/mo. FT amortized: ~$1,800/mo. FT looks cheaper, BUT: weekly doc updates mean retraining monthly = $1K extra. Plus 1-4 week staleness window. Real answer: hybrid (FT for triage, RAG for fresh content).
Healthy range: Either works - staleness drives the call
Legal contract review, medical coding, scientific paper analysis - narrow domain, rare doc updates, high volume. FT inference ~$8K/mo vs RAG ~$15K/mo. Setup amortized in 6 weeks. Plus FT can encode domain conventions RAG can't reliably retrieve.
Healthy range: FT wins by 40-60% - clear call
Cost isn't the only dimension. Click any constraint — see how recommendations change.
RAG cost is dominated by retrieved input tokens. FT cost is dominated by training spend + smaller-model inference. At <1K queries/day, the math always favors RAG. Above 10K, FT typically wins.
Domain-trained models confidently produce wrong answers in their domain. Eval pipeline mandatory.
RAG with citation enforcement is the lowest-hallucination architecture. Fine-tuning improves fluency in domain but doesn't guarantee correctness - and hallucinations sound more authoritative.
Audit trail per query - each retrieved chunk is logged.
Where did training data come from? PII? Proprietary content? Document everything.
FT introduces training data governance burden RAG doesn't have. For regulated industries, RAG's per-query attribution is a compliance feature.
Fine-tuning a vendor's model means your training data is potentially seen by the vendor (depending on tier + contract). Self-hosted FT has the highest privacy posture but adds operational burden.
Voice agents need sub-300ms TTFT - RAG retrieval often blows that. FT or in-prompt context wins for latency-critical use cases.
Fine-tuned models on Vendor X don't migrate to Vendor Y. Switching vendors = retraining from scratch. RAG architectures port across vendors with prompt edits.
FT operational burden is 2-3× RAG. Budget for it.
FT MLOps overhead is the silent cost teams underestimate. Eval pipelines, drift monitoring, retraining triggers, version management. Plan for 1 ML engineer's time at minimum.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Product docs change too often. FT means stale answers. Stick with RAG; invest in caching the system prompt + retrieved chunks.
Healthy range: Use RAG. Fine-tuning's staleness kills it.
ICD-10 codes, legal citations, specialized terminology. FT model encodes formats/conventions reliably. Stable codes = no staleness. Premium tier justified by accuracy needs.
Healthy range: Use fine-tuning. Domain conventions encoded in weights.
Engineering wiki, HR policies - frequent updates, modest specialization. RAG handles freshness. Optional: tiny FT classifier ($500-1K) for query intent routing - cheap, narrow, high-leverage.
Healthy range: RAG primary, optional small FT classifier for routing
High-volume agent that picks among 30 tools. FT a small model ($3K setup) on tool-selection examples. 50K decisions/day at 1/10 the inference cost of RAG. RAG only fires when an answer is needed (10-15% of turns).
Healthy range: FT for tool selection; RAG only when needed
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Embedding Cost for RAG indexing. Fine-Tuning Cost for FT detail. Vector DB Cost for RAG storage.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →