Guides → Playground & Guide → RAG Pipeline Cost - Full Stack from Index to Answer
Meet Krishna Iyer. Tech Lead designing a customer-facing knowledge bot. "What does a 1M-doc RAG pipeline actually cost to run end-to-end?"
🔥 Board approved RAG project. CFO wants total cost. I have 5 vendor invoices to combine.
RAG cost has 5 line items, not 1. (1) Embedding indexing - one-time per re-embed. (2) Vector DB storage - monthly. (3) Query embedding - per query. (4) Vector DB query - per query. (5) LLM read with retrieved context - per query, dominates cost. Most teams only model line 5 and miss the rest.
Krishna's 1M doc RAG: 1.5B tokens to embed (one-time, $30 with small model). Vector DB: 1M × 1536 dims = 6GB storage = $50/mo on hosted. 5K queries/day = 150K/mo. Each query: 100-token query embed ($0.0001), 1 vector search ($0.0001 hosted), then LLM read with 6K of retrieved context + 500-token answer (~$0.025 on Sonnet). Per query: $0.025. Monthly: $3,750.
The LLM read dominates at 95%+ of monthly cost. Embedding/retrieval are negligible. Optimization focus: reduce tokens-per-LLM-read (smarter retrieval, better chunking, output compression). Don't over-engineer the cheap parts.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
RAG isn't one cost - it's five. Embedding indexing + storage + query embedding + retrieval + LLM read. Real architecture math for production RAG.
token
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →5 line items, but LLM dominates. At Krishna's scale: $30 indexing (one-time), $50/mo storage, ~$5/mo query embed + retrieval, ~$3,700/mo LLM read. LLM is 96% of monthly.
Optimization order: reduce LLM cost first. Caching (system prompt + repeated chunks) cuts 20-40%. Smarter retrieval (fewer, more relevant chunks) cuts 15-30%. Tighter outputs cut 10-15%. Stack these for 50%+ savings.
Don't over-optimize embedding/storage. They're already <5% of monthly. Saving 50% of $50 = $25/mo. Eng time better spent on LLM-side optimization.
Watch chunking strategy. Too small = lost context = more retrieved chunks = more cost. Too big = less relevant = lower quality. Sweet spot is workload-dependent.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Internal knowledge base, 100K docs, 1K queries/day. pgvector storage marginal cost. Standard 4-chunk retrieval. ~$500/mo.
Healthy range: $300-700/mo
Krishna's customer-facing bot. 1M docs, 5K queries/day, balanced tier. ~$3,750/mo. Add 30% caching savings → $2,650/mo realistic.
Healthy range: $3-4K/mo total
Consumer scale: 200K queries/day. Cheap-tier LLM (Haiku/Flash) with strong retrieval. ~$60K/mo. Without optimization: $250K+. Caching + multi-vendor routing mandatory.
Healthy range: $45-80K/mo with cheap-tier LLM
Cost isn't the only dimension. Click any constraint — see how recommendations change.
RAG cost optimization order: LLM tier > caching > retrieval reduction > vector DB > embeddings. Don't reverse this - it's where your money actually goes.
RAG hallucinations come from bad retrieval (wrong context retrieved) more than bad generation. Invest in retrieval quality + citation enforcement.
RAG's per-query source attribution is a compliance feature - explainable AI by default. Vendor for vector DB matters for regulated content.
RAG creates two privacy surfaces: vector DB (storing embeddings of sensitive content) and LLM (seeing retrieved chunks). Both need attention.
Latency budget: retrieval (100-300ms) + LLM TTFT (200-500ms) = 300-800ms perceived. Streaming starts response as soon as LLM begins. Cache hot retrievals where possible.
RAG lets you swap LLM vendors freely (re-tune prompts only). Vector DB switch is harder - different filter syntax, metadata format, performance characteristics.
RAG MLOps is real: retrieval eval, re-embedding cadence, chunk strategy iteration, citation accuracy monitoring. Budget 0.5-1 ML engineer per production RAG system.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Engineering wiki RAG. Modest scale. pgvector + balanced LLM. ~$220/mo. Don't over-engineer.
Healthy range: $150-350/mo
Premium embed + premium LLM for medical accuracy. Large retrieval (8K tokens for clinical context). Self-hosted vector DB for HIPAA. ~$6.5K/mo.
Healthy range: $5-8K/mo (premium across stack)
Product catalog RAG. Small chunks (product descriptions ~800 tokens). Cheap LLM with structured output. ~$11K/mo at 50K queries/day.
Healthy range: $8-15K/mo at this scale
Code search across 2M files. High retrieval volume + balanced LLM. ~$40K/mo. Self-host vector DB mandatory at this scale.
Healthy range: $30-50K/mo (Cursor-style)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Embedding Cost for index detail. Vector DB for storage. Hybrid Search for retrieval upgrade.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →